Good morning, My name is Grace Currie and this is my colleague Ann Jebson. We are from the Information Management team at Statistics NZ and we are here to present to you what we like to think of as “ A case study of DIY data management”.
All of us are here because we are interested in how to integrate digital preservation requirements into the design of systems. When we think of system design it’s sometimes hard not to think of a magical IT system that manages digital content from time of creation. But system design is also about: Processes Understanding your business Communicating with your business Influencing workflows Cultivating the right attitude in your organisation. As New Zealand’s National Statistics Office, our core business revolves around the collection, analysis and publication of data. This data has immense ongoing value. Today Ann and I will tell you how the Information Management team at Statistics NZ approached the task of identifying thousands of datasets and their associated metadata and documentation so they could be preserved for future reuse.
This is a journey that has taken us from the “fire fighting” “ambulance at the bottom of the cliff” position, which I’m sure many of you will be familiar with……
…… .to a state where we can now support the data management practices of our business. For us this means being involved over all the phases of the model you see here – our Statistical business process model. This model illustrates the seven stages that most studies follow during the production of official statistics. When we say a “study” we mean an activity where data is collected, for example by a survey or a census, to produce a set of information. Studies you may know of include: Consumers Price Index (CPI), Gross Domestic Product (GDP) and the Census of Population and Dwellings.
The ever increasing volume and diversity of content created in our organisations means developing innovative methods to identify what content preserve is more important than ever. The message we want to get across to you today is that although such a task can be daunting, it can be achieved with tools that we all have readily available. Don’t be afraid to DIY.
A few years ago our data management situation was a bit like a teenager’s messy room. We had stores of digital information rapidly growing in multiple locations. The problem was that our statistical analysts were very competent with the confidentiality, privacy and security aspects of data management, but they weren’t so good with documenting t he basics like file names and locations what the data was used for and what metadata was associated with what data This was a bit of a problem considering that in 2005 the Statistics NZ Data Archive was established as a repository to preserve valuable data and ensure its availability to future users, both internal staff and external researchers. We had no shortage of valuable data to preserve in the Data Archive – legacy data was in abundance – but we lacked the basic information required to begin ingesting this data into our archive in large quantities.
To give you an idea of the size of the problem we were tackling consider this: The move to digital processing tool place around thirty years ago Currently, there are over 250 releases of official statistics every year Multiple datasets are produced for each of these releases over the collection, processing, analysis and dissemination of the model I showed you earlier
In addition to this, data is different from other digital content. Data is not as self-descriptive as other information, such as written documents or images. The numbers you see here mean nothing without context. For data, that context is provided by statistical metadata. Statistical metadata is information that helps us understand data and make information out of numbers. Statistical metadata refers to information about surveys and their publications, questions and questionnaires, variables and methodologies Therefore, to preserve our valuable data is a way that would mean it would be understandable and usable in the future we also needed to locate and consolidate a large quantity of statistical metadata for each study.
So, how did we approach this situation? We got off to a good start because we had two things that were instrumental to success. Firstly: That Statistics NZ, more specifically the senior leadership, had a vision for data reuse. Our 2006 statement of intent talked about our data as “an enduring national resource” and placed importance on “ensuring that information is maintained in an accessible format for possible future use”.
Secondly: Since data fell outside of the coverage of Archives NZs General Disposal Authorities, Statistics NZ and Archives together developed a specialised Appraisal Report and Disposal Schedule for Statistical Data, Documentation and Metadata . The appraisal report recommended the retention of final, definitive versions of official statistical datasets It also recommends retention of the core documentation and metadata which summarises the design, development, collection, processing, and analysis of official statistical collections and data. This gave a yard stick with which to evaluate our data.
However, there was much we didn’t know and we didn’t have a process or system in place to gather this knowledge. Preservation assumes that organisations have knowledge that many do not have - namely the fundamentals: what you have, where it is, how much there is and what it looks like. We needed a process and a vehicle to help us to document our data assets. This where our DIY system comes in. I’ll now hand over to Ann who will tell you more about this.
So, we developed a process (to tidy the room) We created what we call the Retention, Preservation, and Disposal statements for statistical data. Or (RPDs) These are documents in which we record what statistical information we have, and what we plan to do with it. We began in 2007 with a basic template in Lotus Notes, but moved on to an excel template when we realised that we more detailed information. The important thing for us was to make a start - and then we refined our process as we learned more about what we needed. This template allows us to standardise the collection of the information that we need. And, we were aware that someone needed to manage the process to ensure that the documented information meets consistent standards and that all studies are covered which in our case is Information Management. And, most importantly, the RPD process is a collaborative process – a team sport. Statistical business units and Information Management work together to produce and evaluate the RPD statements for managerial approval.
Our process is low tech – our template is an excel workbook with separate pages where we list the different types of information that we need. We need pages for the scope of the RPD – which provides a description of what the study is at a high level, and includes information about who the main users of the study are and what it is used for. This information helps us to know how much effort to put into archiving the data and the amount of metadata we will need to archive with the study so that it can be properly understood in the future. Other pages list Datasets , and documentation about the study, and lastly a page to record when the RPD statement will be formally reviewed.
The data page lists the datasets that the study produces . The most important information is the filename and location – including file path and server name - for each dataset. Also, record classes and disposal decisions for datasets are recorded here. All datasets produced are listed - those that will be archived and also those that will be destroyed when their operational purpose is complete. Also, any data that should be listed here, but cannot be found , must also be recorded, with a note saying that it cannot be found.
The documentation about the study falls into these 3 types Typically, the publications page would list the Information releases, publications and articles but could also include conference papers, or important presentations relating to the study. Statistical metadata is the information that makes the numbers into data – in that it gives context and meaning to the numbers in a dataset – it is, therefore, vital to the data being useful in the future, when everyone who knows about the study has gone. The information we expect to see here includes: sampling methods, questionnaires, classifications used, and processing documents. Lastly, the Corporate Information page will list any contracts, business cases, relevant corporate policies, etc. to do with the study. Any documentation that should be listed in these pages, but cannot be found or was never produced , must also be recorded, with a note to say so – this eliminates ‘time wasting’ - looking for things that cannot be found or never existed. Once again, we want to know: What it is why it is what dataset it refers to, and where it is located.
The RPD statement is reviewed on a regular basis. A formal review time is specified in the RPD. At this time a new version is created which is updated with the latest information about the study. We also meet with the data custodian for an annual informal review. This is a quick and casual meeting which is a valuable way of keeping in touch with who is responsible for the data, what is happening in the business unit, and what is being planned – a great way to gather intelligence and to have a feel for what is going on in the organisation. These reviews are crucial, they ensure that the process of documenting data and metadata is embedded in the organisation – it is not about doing it once.
When we first introduced the RPD process to the organisation – the reaction was predictable. Some people thought it was a great idea and could see the benefits immediately Others were not so keen
Everyone is already very busy Particularly those responsible for a number of studies, and that publish their data monthly, quarterly and annually Their focus very quickly moves from what they have just published to what is about to be published and no one is keen to take on what they perceived to be extra work The frequently asked question is how long will it take , or how much effort is required ? And all we could say was – it depends …. It depends on how complex the study is Some are complicated – like the Consumers Price Index, Balance of Payments, National Accounts Other are relatively straight forward It also depends on how well the data is already being managed If data management is not good, the work to located, and appraise and documentation will take a long time e.g. Quite some effort went into discovering the final dataset for the Marine Recreational Fishing Survey that was run in 1987 – it was eventually found – named <click> ‘Pisces’
However, if data management practices are good, then it is just a matter of documenting the fact in the RPD statement And even better, if standard file names, and file structures are used, these will only need to be documented once The point being that - if standard conventions are followed - it will take less time than if free spirits have been allowed to name and structure files creatively.
This is an example of a sensible naming convention. This is part of the data page for the Household Labour Force Survey RPD. This survey is published quarterly and produces New Zealand’s official employment and unemployment statistics. The survey has been published quarterly, for 25 years, producing more than 17 datasets each quarter, but the data page of the RPD statement is relatively simple. There are 18 lines to the data sheet and in general, they will not need to be changed. The only change will be to add an additional line if a special dataset is created for a particular purpose during a quarter.
How did we get ‘buy in?’ As Grace mentioned, we have the active support of senior management - which is the first thing you need if you are going to succeed. We spent a lot of time selling the benefits of having ‘up-to-date’ RPDs
The benefit of being tidy and organised – of documenting what you have and where you put it – reducing risks. The risk of knowledge being lost – for example, when information is stored in people’s heads, -the information is not available to the rest of the organisation, - and it leaves the organisation when that person leaves Of datasets not being stored in correct locations e.g. if data is stored on personal drives, it needs to be moved to shared drives, and insufficient documentation about a process. We also provided support - mainly through personal help and encouragement, but also by publishing ‘A Guide to completing an RPD statement’, and an exemplar of a completed statement. The Appraisal guidelines and corporate policies and processes that were already established also supported decision making. Another way to get ‘buy in’ is to Appreciate the effort that is put into completing the RPD statement <click> http://www.flickr.com/photos/earthworm/2916565549/
The main currency of appreciation at Statistics NZ is Chocolate But we also provide positive feedback for those that have produced quality RPDs – at the time of manager sign off and also at performance review time.
Similar to appreciating the effort is appreciating the content , in our case - the data Our statistical analysts love their data and find it infinitely interesting Showing a genuine interest in understanding and valuing their data, data process, and metadata certainly helps to get the job done
The disposal decisions recorded in RPDs provide an opportunity to free up space on shared drives Lack of disc space is a constant issue at Stats. At Statistics NZ data is generally disposed of in one of 2 ways – it is either preserved or destroyed. We preserve data in the Data Archive but before we agree to do this, an RPD statement must be completed and signed off by managers When data has been successfully archived, the duplicate copies can be destroyed, which frees up valuable space on drives The other option is to destroy data We are terrible hoarders at Stats NZ - RPDs have introduced the idea that data can be destroyed, and in some cases, must be destroyed. But, data may not be destroyed unless it is listed in the RPD statement and the record class assigned to it allows for destruction.
Unexpected benefits have also helped with ‘buy in’ This is a case where rhetoric becomes tangible. Statistics New Zealand has a very productive office in Christchurch – our colleagues there produce and release a range of business and population statistics. At the time of the earthquake in February 2011, many of the studies that are compiled in the Christchurch office were being analysed or due to be released. The information stored in the RPD statements about filepaths and server names was used to help identify and prioritise the work to recover data from the Christchurch server back up. The process documents that were recorded also enabled analysts in our other offices to help with analysing data.
So, what have we learned?
Do it as you go Data Management needs to be part of the business process – we need to manage data and metadata from time the study begins, through the entire data cycle - and completing the RPD statement needs to be in the Business Unit’s work plan – as part of the survey documentation – at Stats, if it is in the workplan – it will happen ! Retrospective metadata gathering is very difficult if not impossible When a study becomes obsolete, people move on very quickly - this can make preserving the obsolete data - as meaningful data - very difficult. The metadata that makes the numbers meaningful must be documented as you go.
What makes it easier …. Choosing the right person for the job The person to coordinating the programme of work needs to be an influencer – someone who can get other people to do things for them, someone who understands the data cycle of the business unit and its pressure points – knows when to push, when to leave them alone, and when to help someone who can work around ‘road blocks’ and gets things done. You also need: The right person to complete the RPD statement Someone who understands the data through all parts of the data cycle, and who knows and understands the importance of metadata and other the documentation that supports the data Completing RPDs is not a good way for a new person to learn about a study – that a recipe for frustration all round !
Take your opportunities The RPD process has provided an opportunity to influence and educate best practice data management. During the process we see what actually happens – and when what actually happens is not what should happen – we have an opportunity to educate and change processes and habits to best practice, or at the very least, to alert the organisation to risks. We have experienced a change in culture regarding data management at Statistics NZ. The attitude towards the RPD process has changed. The importance of documenting information about data and metadata so that it is current and available is now embedded in process and is accepted as part of what we do.
RPDs were instrumental in moving us to our current state where we can now support the data management practices of our business over all stages. The Data Archive now provides datasets for researchers to use in the Statistics NZ Data Laboratory, and for internal staff to reuse in statistical production. Like we did with the RPD process we are are still refining. There is currently work underway to refine out archiving process through automation. And last, but definitely not least, as a result of what we have learnt with the RPDs, we now have our own “ magical IT system” that will manage statistical metadata over the whole data cycle. This metadata repository is called Colectica and is about to be rolled out to multiple business units. So, our system and process has evolved into something bigger and better which we think isn’t too bad for a bit of DIY.
Grace Currie Ann Jebson First Things First
First Things First:Figuring Out What to Preserve and Why A case study of DIY data management Grace Currie & Ann Jebson March 2012
Systems? We should take wide view of systems: • Processes • Understanding your business • Communicating with your business • Influencing workflows • Cultivating attitudes02/04/12 2