Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Transcript - Provenance and Social Science data


Published on

This is the transcript of the first webinar in the “Making Data Social” webinar series, which will discuss data issues of specific interest to the Social Sciences.

Full Webinar:

Published in: Education
  • Be the first to comment

  • Be the first to like this

Transcript - Provenance and Social Science data

  1. 1. [Unclear] words are denoted in square brackets. Webinar: Provenance and Social Science data 15 March 2017 Video & slides available from ANDS website START OF TRANSCRIPT Kate LeMay: Today we're going to be speaking about provenance and social science data. So you should be able to see on our screen we're showing our data provenance, community page and we have a data provenance interest group and if you're interested in that you can contact us through the contacts on that page. We have our speakers here. I'm Kate LeMay, I'm from ANDS and I'm one of the research data specialists at ANDS. We have George Alter, Steve McEachern and Nicholas Car. We'll give each of them a little bit of an intro when we get to their point in speaking. So as I mentioned this is part of a series, today's our first one. So I'd like to introduce Steve and Nick who will be speaking first. So Steve is the Director of the Australian Data Archive at the Australian National University. He holds a PhD in industrial relations and a Graduate Diploma in Management Information Systems and has research interest in data management and archiving, community and social attitude surveys, new data collection methods and reproducible research methods. Steve has been involved in various professional associations in survey research and data archiving over the last 10 years and is
  2. 2. AUDIO_-_Provenance_and_Social_Science_data Page 2 of 25 currently chair of the executive board of the data documentation initiative. And Nick, Nicholas Car is the Data Architect for Geosciences Australia, GA. In that role he designs and helps build enterprise data platforms. GA is particularly interested in the transparency and repeatability of its science and the data products it delivers. For these reasons, Nick implements provenance modelling and management systems in order to represent and store information about data lineage. What was done and who did it and what they used to do it. Previous to working at GA, Nick was an experimental scientist at CSIRO and researched metadata systems, provenance, data management and linked data. He currently co-chairs the International Research Data Alliances Research Data Provenance Interest Group which the ANDS Provenance which the [unclear] Group works with and through that and other groups assists organisations with provenance management. Nicholas Car: Okay, thanks Kate. All right so this is a very quick introduction to PROV, so PROV is a provenance standard and what you see on that first slide there is a very, very simple diagram of a little provenance network and I'll discuss some of that as we go. So it's not just a frivolous diagram, it's actually - it has some meaning. Okay so the outline for today, so what is PROV? I'm just going to mention that very quickly and then I'm going to get to how do I actually use this thing in a couple of different ways. So first I'll talk about modelling. Then I'll talk about how do I actually manage the data once I've collected or made provenance data and then I'll talk about using PROV with other systems. So what is PROV? PROV is a W3C recommendation. So W3C is the World Wide Web Consortium. So it's one of the governing bodies of internet standards. They don't issue any documents called standards. They issue documents called recommendations. So PROV is a recommendation. It's top level of standard I suppose. Other standards
  3. 3. AUDIO_-_Provenance_and_Social_Science_data Page 3 of 25 by the W3C are things like HTML. I'm sure everyone is familiar with HTML at least to some extent. PROV itself was completed in 2013 and sort of formalised by the end of that year. So it's only a couple of years old and a large number of authors were involved in PROV. There were several initiatives to make provenance standards before PROV over the last perhaps 20 years and many of the authors involved in those standards such as PML and OPM, I'm not going to elaborate [unclear] so if you're interested in those previous standards just Google them. Many of the authors involved in those initiatives were involved with PROV. So PROV really does know about those other initiatives and it's simpler than those precursors because it's trying to do sort of a high level standard. It doesn't do as many of the tasks as those precursors do, but it certainly represents the very important bits that they come up with. Another thing to say about PROV is that there's no version two planned any time soon. Why am I bringing this up now? Well it's a pain for people to have to deal with standards and then versions twos and threes and fours of standards. PROV doesn't quite operate like that and I'll explain how. It is what it is and there are ways to extend it and use it in different circumstances, but it's unlikely that we're going to see any version change in the next few years I would think. It's seen good adoption. PROV is really the only international broad scale provenance standard and as a result people are happy to - I think, happy to adopt it in lieu of really anything else. Right, so PROV is actually a collection of documents and I've just listed them there. I'm not going to go through them all in great detail, but there is an overview document and then certain bits and pieces which are actual recommendations or standards and additional things that just help you use the PROV thing. Now the main document is the PROV-DM the data model. That tells you what PROV contains, how its classes operate and so on. Then
  4. 4. AUDIO_-_Provenance_and_Social_Science_data Page 4 of 25 there's a series of documents like an XML version of PROV, an OWL ontology version and special notations and so on. The only other one I'll mention is the PROV-CONSTRAINS which is a list of things, of rules that PROV compliance, chunks of data must adhere to and that works across any formulation of PROV. I'll provide a link there to the collection of standards - of documents. So how do I use PROV? This is a modelling - how do I actually model something using PROV to do the core of provenance representation? Well I'm starting off with some negatives, so don't do it like this. Don't take a document for something, perhaps a metadata catalogue entry and expect to shove a bunch of information into some field within that document. So ISO19115 us a standard for spatial datasets and it's got a field called lineage and some people expect to take provenance information and stick it in that lineage field. Don't do that. PROV doesn't let you do that. I'll explain why in a second. So that's one thing not to do. We're not going to see a single item's metadata record containing a bunch of provenance information. You could do that but not recommended. What else should I not do? So this diagram here is the class model of the DCAT the Data Catalogue Vocabulary which is a very generic metadata model. It's used in relation to things like Dublin Core and various catalogue style things and we're not going to link a dataset or any other object in DCAT or Dublin Core or other standards like that to a class of provenance information. This is true for Steve's DDI initiative as well. We're not going to take objects in DDI and link to a provenance object that tells you the provenance of that object. That's an anti-pattern right there. So what are we going to do? We don't even do this using Dublin Core's provenance properties. So Dublin Core vocabulary as a property called provenance and the wording for that says, use this to
  5. 5. AUDIO_-_Provenance_and_Social_Science_data Page 5 of 25 describe lineage history. PROV doesn't want you to do that exactly like that. What does PROV want you to do? PROV wants you to think of everything that you're interested in in terms of three general classes of objects. So is the scenario, the things that you're interested in, are they things, are they entities? Are they occurrences? Are they processes? Are they activities? Or are they causative people or organisations which [unclear] cause an agent? So PROV says model everything you know about using those three classes and then link them together and that's what PROV's all about. So how does GA use PROV? So we often process chunks of data at GA. So we have a very simple model that's using the provenance ontology and it looks like this. There's some process, the process generates outputs, the outputs are entities, the process itself is an activity and then there's data and code and configuration and so on that feed into that process and those are also entities. Finally the process and the entities might be related to a system and even a person who operates that system. So that's the model we use. Okay so how do I actually manage the data that I get in provenance or that I get according to PROV? Well you can create reports. So you can go and do something. A human or a system could log what they had done and they could store that information in some kind of database according to the PROV model. Then you can - it's a document database but you can query that thing. So we often have systems that sort of send reports every time they run. You might have a form that looks like any other metadata entry form where you fill in details and you hit enter and that sends off provenance information, but again it's not storing it with respect to one specific object, it's linking existing objects together. So some dataset that is produced from another dataset is going to link those two things together.
  6. 6. AUDIO_-_Provenance_and_Social_Science_data Page 6 of 25 For catalogue things we can link things again. If we have a catalogue that has a dataset A or X and a dataset Y and we want to show there's a linking, we can say dataset Y was derived from dataset X and record that information somewhere. Now dataset Y may record I come from dataset X but that's just a very simple, little bit of provenance information. It's not a whole glob of provenance information stored within dataset Y. We can ensure that any system that has information that is provenance information like who the creator of a dataset was, does so in accordance to the PROV model. So in this case if we had a dataset that had a creator, we would say the dataset was associated with an agent and the agent had a role to play and that role, in this case, was creator. That's now a PROV expression of that relationship. For databases it can be very difficult. I can't explain it in depth here, but there's many ways in which databases could store provenance or PROV related provenance information, but they would need to be able to show that they can actually export their content, their provenance content according to the PROV data model. You actually have to prove that if you want to say that you are compliant with a standard. So fairly quickly, how do I get PROV to work with other systems? Well we can fully align our system, whatever this system is. So I've used a theoretical example of Metadata System X. How do I align Metadata System X with PROV? I could classify all of the things in Metadata System X according to PROV. It requires a metadata model for Metadata System X - sorry a data model. Not just in coding formats. We can't just deal with XMLs and so on. We actually have to have a conceptual model and then we can say, this class of thing in Metadata System X is the same as this class of thing in PROV. Now PROV's only got a few classes, so that's usually pretty easy to do. But it will definitely prompt you to do things that you wouldn't normally do. You may have to tease apart some of the objects that
  7. 7. AUDIO_-_Provenance_and_Social_Science_data Page 7 of 25 you know and love into things that PROV recognises as different objects. You could do a partial alignment. You could take your Metadata System X and only acknowledge that some of the things in that scenario are PROV understood things. So maybe you've got a metadata model that talks about all kinds of stuff and one of the things it talks about is a dataset. You say your dataset is the same as what PROV thinks of as an entity and maybe you ignore all the other things. You would still need to demonstrate that you could extract value PROV out of that and not all the other stuff, but that would be one way to do it. You could also link to things not in your own data model if you also classified those things according to PROV. The last scenario you could think about is to just deprecate your obviously not as good systems and use PROV. That would require you perhaps to make either a new dataset of provenance information or a data store and put that information somewhere and that's it. Kate LeMay: Thank you very much Nick. So we'll move onto Steve. Steve McEachern: Nick's talked about the sort of general PROV model that is increasingly getting used in various different spaces. I'm going to talk specifically about the various ways of thinking about provenance in what we're doing in the social sciences, particularly using - within the standard that we utilise and I'm not the Director for the Data Documentation Initiative. Part of the reason we've sort of connected these two together is we're now looking at how we can leverage the PROV standard inside DDI employer [facts]. So Nick and I and a group of others have been working on how we might go about this. I'm not going to touch too much on that but I'll return to that at the end. I sort of want to talk more generally about how we might think about provenance at different stages of the data lifecycle, different stages in the research or in the data management experience and how we
  8. 8. AUDIO_-_Provenance_and_Social_Science_data Page 8 of 25 progressed thinking about provenance over that time. Just to give you a sense of, well what sort of things we can do already and how can we increasingly embed, capture provenance in what we do. Okay, I'm quickly going to - for those who don't know. The Australian Data Archive, we've had various names over time. I'm going to do a quick introduction. We've been around for a little while now based here at the Research School of Social Sciences at ANU. Our mission is to collect and preserve Australian social science data on behalf of the social science research community in Australia and internationally. Now we've sort of developed a collection of over 5000 datasets now over 1500 different studies as we call them or projects. Lots of different sources, lots of different provenance from various different locations, academic, government and private sector. So as our holdings have developed, our understanding of provenance has developed probably alongside that. Maybe we didn't call it that at the time but after 35 years I think that's always been sort of underpinning a lot of what we've done. There's helping researchers who might be the secondary users of our data to know where did this come from, what was it used for and how might I use it in the future is really the emphasis there. For those who don't really know what we're talking about when I use the term data archive, we're using the term a trusted system out of a project done by the Social Science & Humanities Research Council of Canada. They're kind of the equivalent in Canada for the ARC. “An accessible and comprehensive service empowering researchers to locate, request, retrieve and use data resources…” - so you've got to be able to find it and understand it - “…in a simple, seamless and cost effective way, while at the same time protecting the privacy, confidentiality and intellectual property rights of those involved”. Part of why we're interested in provenance is really that last point.
  9. 9. AUDIO_-_Provenance_and_Social_Science_data Page 9 of 25 One is to help researchers understand where this came from but it is to sort of recognise and acknowledge the intellectual property that's been developed in those resources over time. Okay, so I'm going to give a brief introduction to the DDI standard and its different flavours. As Nick pointed out, having multiple versions is not always much fun. We're up to version four. We're about 20 years old now. So I think that's not too bad from Nick's point of view. How we've sort of captured what we might think of as different forms of provenance over time. So I've got the website there the website if you're interested in knowing more. You know you can go and explore the different versions of the standard there. So what is DDI? It's a structured metadata specification developed for the community and by the community. So particularly in social science data archives that exist in most OECD countries. It's used in about 90 different countries around the world now thanks to work by the World Bank and the World Health Organization and others. There's two major development lines that are basically XML Schemas. One's DDI Codebook and the other DDI Lifecycle which both correspond to version two and version three of the standard. I'll talk a little bit more about those in a moment. We have some other elements to it as well, additional specifications including some controlled vocabularies often for things like encoding methodology, data types and data capture processes and some RDF vocabularies so that we can sort of start moving into a linked data world. So you can leverage the standard, particularly the Lifecycle standard into a linked data environment. The current - the version four is in development at the moment, has been over the last couple of years and that's where the work with Nick has sort of come on board as well. It's moving to a model based specification. So rather than being based in a particular schema we're looking at sort of to focus on the model then its expression into various different formats. The provisional ones at this point are XML
  10. 10. AUDIO_-_Provenance_and_Social_Science_data Page 10 of 25 and RDF and that includes support for provenance and process models. So we're looking at that point at how do we leverage what we know from PROV to support the provenance model within the new version of the standard. It's managed by the DDI Alliance. So briefly on the two versions of the standard are already in place, so it's been around in the codebook format which has its origins in - the print codebooks are produced by organisations like [Georges] going right back to the 1960s and 70s. So we've sort of formalised in the social sciences a fairly structured way of thinking about describing data back 40 years ago really. So the codebook version of the standard really is an after the fact description of what this dataset is about. It includes four basic sections, the document description which is describing the document that's ascribing the dataset. A study description, we use the term study to describe sort of the package of datasets that encapsulate a project. So that includes characteristics of the study itself that the DDI is describing. That includes lots of sections on authorship, citation, access, conditions, but particularly, from the point of view of provenance, we have their methodological content, data collection processes, sources. Then we also include a lot of what we call related materials. They are documents associated with the project. We tell you something about the provenance of where it came from. It includes all the questionnaires, previous codebooks, technical reports, et cetera. So from a human point of view your starting to get into the area of thinking about provenance even though it's not really a [machine actionable] version of that. We also describe the files themselves, the characteristics of the physical data files, data formats, et cetera, their size and their structure. Then what we call variable descriptions. Descriptions - the variables that are included in the data file. The simplest way of
  11. 11. AUDIO_-_Provenance_and_Social_Science_data Page 11 of 25 thinking about this is the columns of a tabulated dataset. What does that column mean because in a lot of the social sciences, a column - a number does not represent actually a number. It represents a characteristic of some sort. For example, a five point agree/disagree scale in a survey, how do you interpret a lot of those becomes important. George is going to talk to a specific project looking at how we do a lot more with the variable description and the [unclear] of variables in a moment. So codebook was really describing - developed to describe things after the fact. The DDI Lifecycle Model takes a more data lifecycle approach to thinking about capturing metadata and provenance. Underlying it is the model we have on the screen here. I think this is just a working model of describing the different processes in the DDI framework that a dataset can go through. Everything from conceptualising the study. The first place through collection and processing and distribution, as a side point archiving that data and storing it around for future use and then rediscovery and analysis and repurposing into the future. So it was built with the intent of re-usability and particularly machine action- ability as well. So the metadata that's developed in a dataset can be re-used in the future for sport for the same purpose, a similar purpose or something entirely new. In order to do that you need to be able to understand where did it come from. So embedded in that is generating metadata going forward to be able to look backwards through the lifecycle as well. So as - it's focused on metadata re-use. That re-use of metadata really implies a provenance in expectation. So why DDI Lifecycle, the things it can do, it's machine actionable. It's more complex. There are 27 different schemas. It's probably overly complex if we're being fair. It's structured and identifiable. So every metadata item is actually able to be permanent identified and managed and repurposed if that's required.
  12. 12. AUDIO_-_Provenance_and_Social_Science_data Page 12 of 25 It supports related standards and it supports reuse across different projects and again, that's sort of something that George is going to touch on as well. [I'm going to pass this] because I think there are some particular features for it that I can refer back to in the future. But I want to talk very briefly about how do we think about provenance within the different versions and then pass to George, who just wants to talk specifically about one of the projects there. So if we think about how provenance is being supported here, I mean Nick's approach to the PROV model with really a machine actionable model, fundamentally, DDI Codebook is not really designed for that. But it is designed at least to be able to describe to a human reading a catalogue entry, what the provenance of this dataset wants. So it includes attribution, methodology, data processing, collection and all the documentation we can find on what happened to the data. But it doesn't really do that as a sort of automated way, it's really focused on a human response to prior research to be able to come back and have a look. Similarly, with variables, question texts, [variable] name, what the value, labels mean are all there. DDI Lifecycle is really trying to - it was our first attempt really to look at sort of the machine actionable provenance. So can we capture this along the way or represents again the information from the studies, attribution, methodology and so forth. But particularly with variables it's really trying to look at the reusable elements of how we might reuse questions, reuse columns of data and understand and reuse the basic conceptual ideas that are embedded within that. So, for example, if you've got a variable measuring employment, can I reuse that employment maybe the categorisation that was used, the numbers that were used in the survey and so forth. Then where we're going with DDI 4; our tagline for that is what we're calling DDI Views, is can I - to what extent can I actually embed a provenance model inside that framework. So now we're moving
  13. 13. AUDIO_-_Provenance_and_Social_Science_data Page 13 of 25 towards really recognising the importance of provenance both conceptually and even sort of the physical and digital formats of data as well. Measuring codes and categories across the lifecycle. For example, managing what provenance of missing values. If your value of a datum changes, how do I understand that? So we've got really - we're able to generate this out automatically, what happened at the level of an individual datum of a variable or of a dataset. So we're moving progressively towards the sort of framework that Nick described but [unclear] but that requires the management of the metadata that we have to be moved forward. That's kind of it from me. George Alter: Hi everyone. Thanks very much to ANDS and to ADA for inviting me to be here. What I'm going to talk about today is a project that started in October with funding from the US National Science Foundation about capturing metadata during the process of data creation. So I don't think for this audience I have to go into this - justify metadata but the big problem that we face is how do we actually get the metadata. That's often more difficult than we - it's a lot easier to describe it than it is to actually get it most of the time. So to give you some backgrounds I'm going to put this in the context of my home institution which is the Inter-University Consortium for Political and Social Research located at the University of Michigan. We've been in the business of archiving social science data since 1962 and we're an international consortium of more than 760 institutions. We were also one of the founding members of the Data Documentation Initiative Alliance which Steve just talked about and we actually provide the home office for the DDI Alliance. ICPSR has been using DDI for many years but we're now getting to the point where we're able to build all kinds of tools that take advantage of DDI. One of the first things that we were doing which we've been doing for at least 10 years is that when you download data from ICPSR you get with it a codebook and pdf. What the pdf is [that] - we created from the
  14. 14. AUDIO_-_Provenance_and_Social_Science_data Page 14 of 25 DDI, not the other way around. So for us the DDI is the native version of the metadata. So what we started to do is take advantage of DDI to build more kinds of tools. One of the first ones we created was what's called our variable search page where you can put in a search term and look for questions that have been used in datasets that are like that search term. So this is an example of the results that come out of a variable search and we are now searching over more than 4.5 million variables in about 5000 studies or data collections. One of the things that DDI makes possible is that we can go from this search to other characteristics of the data. So you can see here in the blue that there are a number of things that are hyperlinked. If you click on the place I've got circled, it takes you to an online codebook. The online codebook has a number of features. It tells you the question that was asked. It tells you how it was coded. If the data are available online you can go to a cross tab tool and it also can link to an online graphing tool. The other thing that you see on the left side of the screen is a list of the other variables in the dataset. So you can move around in the dataset and clicking on any of those variables will bring up a display a bit similar to this. Another thing you can do from our variable search screen is if you click on these check boxes on the left, you can pick out a certain number of variables that you want to look at more closely and clicking on this compare button at the top there brings you to this screen which is a side by side comparison of these different variables which come from different studies and so you can see whether they're asking the same question, whether they're coded the same or differently. As before, this screen is also hyperlinked to the online codebook so you can go back and forth. One of our more recent tools which I think is one of the most powerful is that you can now search for datasets that include more than one
  15. 15. AUDIO_-_Provenance_and_Social_Science_data Page 15 of 25 variable that you're interested in. So this is a search in using what we call our variable relevant search that's actually in the study search rather than the variable search, where we're looking for three - variables about three different things. Does the respondent read newspapers? Do they volunteer in schools? What's their race? You can see here that the results come out in three different columns within each study so you can see which variables are present in each study. As before everything is hyperlinked to both the online codebook and the variable comparison. So you can check on any combination of these variables and compare them side by side. Another thing that we did as another previous NSF project, working with the American National Election Study and the General Social Survey, we made a crosswalk of the variables that are available in those two studies. Now the American National Election Study started in 1948 and is done every four years. The General Social Survey started in 1972 and is done every two years. So we're actually going to be looking over 70 different datasets. What we've done is created this crosswalk where we've grouped the variables according to certain tags. We've got eight lists of tags and then 134 tags in total. The columns here, each column represents a dataset and there are70 datasets. All of the variables are linked here and I can't actually show it here but if you hover over one of those variables it shows you the question text for the variable. Again you can use the checkboxes to pick out things that you want to compare and go the variable comparison screen. So this is - a crosswalk like this is a tool that's actually very common. You've probably seen these before. There are two things that are different about this though. One is that this is all keyed into the online codebook so you can go transparently back and forth. The other thing is that we can use this tool to crosswalk any of the 4.5 million variables in the ICPSR collection because this is drawing
  16. 16. AUDIO_-_Provenance_and_Social_Science_data Page 16 of 25 directly from our store of DDI metadata and we don't have to build a separate tool for each one. This one tool works over all of these datasets. Another thing that we did in this project was to think about how we could extend the online codebook. So here's our online codebook that you saw before which has the question text and how it was coded, but this version has something new in this location here. It shows how you got to this question. In big surveys every respondent doesn't answer every question. There are what are often called skip patterns. So you get asked what your marital status is and if you're single you go to one question, if you're married you go to another question, divorced people go to a third pattern. So there are different pathways through the questionnaire. What we've done here is try to show, here's how you go to that question which explains why some people didn't answer the question. We also represented it in words down here. So we built this and we were quite proud of ourselves for building it because this does answer the question about who answered this question in the survey. But then we ran into a problem so how do we know who answered the question in the survey. The answer is that we get that information from the data providers in a pdf. The only way we could build this demo prototype was to have one of our staff members enter this program flow information by manually into XML for one of datasets so we could show how this works. So we showed a tool that we think is really useful, but we reached a roadblock because we don't actually get machine actionable metadata about this kind of information. The problem is that when the data arrived at the archive, they don't have the question text. That's something that we at ICPSR and ADA have to type in. They don't have the interview flow. They don't have any information about variable provenance and variables that are created out of other variables are not documented.
  17. 17. AUDIO_-_Provenance_and_Social_Science_data Page 17 of 25 So the project we're working on now which is called C2 metadata for [news capture] of metadata is about how do you get that. To understand that, how do we get it, you have to think about how the data are created and what happens. So first of all the data themselves are actually born digital. People do not go around with a paper questionnaire these days. They use these computer assisted interview programs. They're on telephone or they go around with a laptop of a tablet to answer them. There's no paper questionnaire. There is instead a program, and it's the program that's metadata. So technically at the beginning you start with this computer assisted interviewing system and what you get out of it is the original dataset. But you can also derive from it DDI metadata in XML and there are programs, a couple of different programs that will take these CAI systems, the code that they run on [unclear] to XML. But what happens next, well what happens next is that the project that commissioned the data is going to modify the data. There are a number of reasons for doing that. There are some things that are in the data that are just purely there for administrative purposes. There are some variables that have to be changed to reduce the identifiability of individuals. Some variables that need to be combined into scales or indexes. So what they do is they write a script that's going to run in one of the major statistical packages. They take that script and the script and the data go through that software and what comes out is a new dataset. Well what happens to the metadata, well at this point the metadata don't match the dataset anymore and you would need to update the XML to fix it and nobody likes updating XML. So the metadata get trashed and thrown away. What happens then is this, when the data - after the data are revised, the metadata are recreated. What happens is that we at the archive take the revised data and extract as much metadata from it as we can. So we get an
  18. 18. AUDIO_-_Provenance_and_Social_Science_data Page 18 of 25 extracted XML file and what about the things that went on in the script here? Well we actually have to sit down and extract them by hand. So a person has to read the script and write down what happened. Well what we're working on in - well so what are we missing? Well what we get from the statistics packages are just names, labels for variables, labels for values and virtually no provenance information. So what we're working on is a way that we can automate the capture of this variable transformation metadata. So what I did is this, we're going to write software where you could take the script that was used to modify the data, take the very same script and run it through what we're calling a script parser and pull from that the information about variable transformations. Put that into a standard format which we're calling a standard data transformation language. Then you take that information and incorporate it into the original DDI. You update the original DDI and then you've got a new version of the XML that is in sync with the revised data. So this process then requires two different software tools, one that will read the script and turn it into a standard format and second one that will update the XML and that's what we're building. So we are building tools that will work with the different software packages and update XML. We're actually writing these parsers for scripts in four different languages, SPSS, SAS, Stata and R. The reason we're doing four languages is that if you look at the column over there on the right which is based on downloads at ICPSR in cases where the dataset had all four formats, you can see that there's not a single dominant format. SPSS and Stata are the most downloaded formats from ICPSR and they both have about 24 per cent. SAS and R both have about 12 per cent. If we did one package we wouldn't please any - we'd be pleasing only a few people and we couldn't have an impact. So we're actually writing parsers for four languages.
  19. 19. AUDIO_-_Provenance_and_Social_Science_data Page 19 of 25 Here's something I thought that's come out of our work that you might find interesting. This is about why we need to have a special language for expressing these data transformations. So here are three brief programs in SPSS, Stata and SAS that all are designed to operate on the same data. I tried very hard to make the programs, the scripts identical and I think that I succeeded. But if you run these three programs you get three different results. The key thing here is to look at the last row that the row in which we set the minus one to be missing, in SPSS you get two missing values. In Stata and SAS one of the variables is set to a number, but it's a different one in each one. Why does this happen? Well the reason is that in logical expressions SPSS treats a missing value as missed [unclear], makes the result of a logical expression that includes a missing value, missing. Which in most cases is treated as false. Stata treats a missing value as a number which is equal to infinity. SAS treats the missing value as a number which is equal to minus infinity. So both Stata and SAS actually do return a number when you have one of these comparisons. So it's actually more accurate to represent the data in this way, which you wouldn't see if you just looked at the datasets. So what we're doing is creating our own language. Well we're actually using the language that's been created by another community. The SDMX community called validation and transformation language so that we can put all three of these languages into a common core. So what are we doing and why are we doing it? So the goal of the project is to capture this metadata and automate it. If we can capture more metadata from the data creation process, we'll be able to provide much better information to researchers about what's in the dataset. Automating this process we hope will make it cheaper for everyone and make it easier. That has been one of the principles that we've tried to do here that if we can't make it easier for the researchers, they're not going to do it.
  20. 20. AUDIO_-_Provenance_and_Social_Science_data Page 20 of 25 So the hope here is that the software we get will make their lives easier. Here's just a - to acknowledge my - some of my partners in this, we've got partners from a couple of software firms Colectica, Metadata Technology North America, the Norwegian Centre for Research Data and two of the - the two projects I mentioned, the General Social Survey and the American National Election Study are part of the project too. So that's my talk. Kate LeMay: Fabulous, thank you very much George. We had a question that came through earlier from your talk when you were speaking about people putting variables into ICPSR and searching for them and [Ming 43:40] has asked, when a user searches for a variable or variables, do they need to come up with the exact variable name as in the variable index? George Alter: So right now what we're doing is really a text search. When you search for variables, you're searching over the variable name and the variable label. It also can bring up items that are in the values for the variables. But one of the problems in the social sciences is that people don't reuse questions very often. So we don't have a tradition of reusing questions. It's very hard to find the same question in multiple datasets. The kind of search we're doing now in our question bank is frankly kind of clunky and it often misses things. That's an issue that I'm trying to address in some other projects that we're trying to improve the way we can search over variables. Kate LeMay: Thank you very much. We've got a question for Nick as well. Nick Car: [Unclear]. Kate LeMay: Yes. So Nick we've got a question, how widely is PROV used and what have you found to be the main challenges working with PROV? Noting that a V2 is not on the horizon, is it easy to update a PROV model if a change is required?
  21. 21. AUDIO_-_Provenance_and_Social_Science_data Page 21 of 25 Nick Car: Okay, so first part first, how widely is it used? So my - I have a direct interest in things provenance but aside from that I have an interest in things geospatial and I guess physical sciences data. In that community there's only one game in town, that's through PROV but it's early days. So most of the spatial, geophysical, blah, blah, blah sorts of places, those hard physical sciences side, the either are using their own systems or they're intending to use PROV. There's not many that are actually already using PROV. But there are certainly not many that are intending to use something other than PROV. Outside of my own Geoscience Australia area, other communities I know of including DDI and so on - because PROV's only been around for a few years, if people can characterise their problem in a provenance way, like they actually understand this as a provenance question as opposed to some other kind of question like an ownership or an attribution question, they fairly quickly end up at PROV. So I think it's about as - it's certainly more widely used than any other provenance standard has ever been and it's showing signs of being much more widely used that that and that's because in the other initiatives in the space have been sort of swallowed up by PROV. Now the second part of the question was, what are the problems and I've identified one already which is people have to know that they're asking a provenance question. So we get a lot of questions which are synonyms for provenance questions, probably much like variable naming where people say I'm interested in the lineage of my data or the transparency or the process flow of the ownership or attribution and those are all what could be all provenance questions. The hardest thing to work out is specifically what questions are being asked and then if there is an existing metadata model or something in that space already, what's it doing and what's it not doing and therefore do we need provenance - a specific provenance initiative. So for instance many metadata models have authorship, ownership, creator information indicated in them. So if your provenance question
  22. 22. AUDIO_-_Provenance_and_Social_Science_data Page 22 of 25 is, I want to know datasets created by Nick, that kind of provenance question you can usually answer in other metadata systems. You'd have to have something a bit more complicated than that and determine provenance to then think about using the provenance system. The other thing is the move away from what I call point metadata where you've got a single thing with a bunch of properties that come from it. So a study or a document or a chunk of data with a bunch of properties. That's one way to do things, but what PROV and what other models are interested in is whole networks. Things that relate to other things. It's more complex, but it's much, much more powerful to do that. Kate LeMay: Great, thank you very much. So question for George, how is sensitive data, variables or values controlled for during the C2 automatic capture. ICPSR has a confidentialised service on Ingest, is this process carried over to the C2 metadata project? Is this activity captured in PROV like metadata? George Alter: So the C2 metadata model is to operate solely on the metadata not on the data. So that's really a - so it doesn't really play into the issue of confidentiality. If you're interested, in two weeks we're going to have another webinar where I am going to talk about how we manage confidential data, but in general it's rarely the case that we have to mask the metadata of a dataset for confidentiality reasons. Obviously controlling the data is something else. Kate LeMay: So we've got another question here for George. Your script parser that reads from SAS script, do researchers need to - would they need to install that in their SAS package? George Alter: We haven't gotten to that point yet but probably not. Probably what we'll do, at least as a starting point is offer it as a web service. What you'll do is simply export your SAS program into a text file and upload the text file to the web service and it will download a new XML file.
  23. 23. AUDIO_-_Provenance_and_Social_Science_data Page 23 of 25 Kate LeMay: So we've got another question here. Does PROV support the workflow of creation and approval of provenance data, eg. the PROV entry is proposed and has been submitted to the data custodian for approval? Nicholas Car: Well it's got two kind of answers to it. One is a generic PROV answer and the other one seems to be more in line with a particular repository or a particular set of steps. So this isn't exactly what you asked, but I'm going to answer it in a slightly different way. You can talk about the provenance of provenance which is bit tricky. But say you had information about the lineage or the history of a dataset and you wanted to control that chunk of stuff, you could talk about that thing being a dataset itself, even though it's about something else and manage that. You could certainly work out how to link your dataset to the dataset that contains its provenance information. So you can do that. But the second part of the question, or I think the general sense of the question is more to do with how does a specific repository do things. Does that make sense? Does the PROV support the workflow of creation and approval? Okay, in general and you can represent anything in PROV because it's really high level and it's got those three generic classes of entity, activity and agent. There's almost nothing in the world that I've come across that you can't decompose down into one of those three things. Is it a thing? Is it a causative agent or is it a temporal occurrence. So in general yes. Kate LeMay: Okay fabulous. Nicholas Car: So Natasha asks, philosophical question for the whole panel, how do you think provenance relates to trust. So I'm going to jump in very quickly and say, provenance models before PROV often had the word trust in them somewhere. Many of the motivations for provenance models were to do with trust, we deal with trust as - the goal of Geoscience Australia to put out data and make it open and transparent. It's fundamentally a trust issue for users of that data.
  24. 24. AUDIO_-_Provenance_and_Social_Science_data Page 24 of 25 They want to know how did this data come to be. So that's really what provenance is about. It's about telling about the history of something so you can generate all sorts of trust. But then the specifics of what you put in there and you can work out, do I trust the people who created this thing, do I trust the process that was undertaken to deal with it or transform it. Do I trust the particular chunks of code that we used? So that's the generic answer. Then there's the sort of more specific ones like for data [unclear] repository how do I trust that it's - even though you're telling me something about it, [but whether it's] in fact true. There are also very difficult things about how do I actually trust this metadata even if it looks like it's all [unclear]. If this data comes from God delivered to you on a stone tablet, I could write that down, but is it true? You have to work that out. That is now a non-provenance thing. You have to work out some other way of attributing a trust metric to that claim. That might be that it's digitally signed and you trust the agency that delivered it. So that's an appeal to authority. You might trust that there's enough information present for you to understand the process enough to have confidence in it. It might link to well-known sources like Open Code or something like that, that you trust or maybe there's a mechanism for you to validate certain chunks of data or calculations. So the total number is five, you can look back through the provenance and see somewhere two plus three, where you see five, that you can calculate and you can establish that trust directly. George Alter: So I think - Nick said it very well. But I'll say the same thing in fewer words [laughs]. Nicholas Car: Thanks George, thank you. George Alter: Provenance is really fundamental to trust and Nick really hit the nail on the head when he talked about transparency. Provenance is about transparency and in the world we live in now, even appeals to authority don't work very well anymore. I think that for science to have
  25. 25. AUDIO_-_Provenance_and_Social_Science_data Page 25 of 25 - to gain legitimacy and gain trust, we have to be transparent and that's what provenance metadata is all about. Kate LeMay: So we've reached the end of our time. I'd just like to thank our three speakers for coming along to our ANDS Canberra Office today and speaking to us about provenance and introducing lots of new acronyms to us all. Every time I encounter anything new at ANDS there's always more acronyms to learn. So thank you very much for coming. We have two more webinars in the social science series, so hope to see you there again. END OF TRANSCRIPT