• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Parul Sharma Sally Vermaaten Right Combination
 

Parul Sharma Sally Vermaaten Right Combination

on

  • 648 views

The Right Combination:

The Right Combination:
Using DDI and PREMIS for data preservation
Parul Sharma & Sally Vermaaten

Statistics

Views

Total Views
648
Views on SlideShare
648
Embed Views
0

Actions

Likes
0
Downloads
12
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Much of the important information about the world we live in today is recorded as structured data rather than unstructured documentation. Structured data is diverse in content and expression- ranging from commercial databases containing client information to geospatial and scientific research datasets. As structured data, such as statistical data, contains important information that scientists, businesses, and researchers may want to reuse in the future, there is an increasingly urgent push for its preservation.Preservation and re-use of data requires that data be described with appropriate metadata that will allow future users and machines to discover and interpret it. Organisations who want to preserve data must make a series of choices about how to describe it using the right combination of standards. In this presentation, we will use the Statistics New Zealand Data Archive as a case study for examining the point of connection between a statistical metadata standard that supports active data management (DDI) and a metadata standard that supports preservation (PREMIS). We will share our experience in using DDI and PREMIS to describe statistical data and will highlight how data-specific metadata can be used to support long-term preservation.
  • We live in a data-driven society today. We’ve got vast quantities of geospatial data driving systems like Google Maps, we’ve got data-intensive sciences like astronomy that work with petabytes (1000 terabytes) of data, national statistical organisations like Stats NZ regularly collect data from individuals and businesses across the country to enable a better understanding of our society, there’s swathes of online data collected everyday by companies like Amazon and Facebook to help drive marketing decisions… and all this data is extremely useful and valuable.Image credits: http://rifm.org/default.htmhttp://www.stats.govt.nz/browse_for_stats/snapshots-of-nz/nz-in-profile-2012/~/media/Statistics/browse-categories/snapshots-of-nz/nz-in-profile/2012/nzip-2012-food-prices.PNG
  • For statistical organisations like Stats NZ the primary driver for preservation of data is re-use of expensive data collections to answer questions that demand longitudinal data  Image credits:http://www.stats.govt.nz/Publications/MacroEconomic/productivity-stats-sources-methods.aspx
  • I can’t find itI can’t identify the objectI can’t open it because I don’t have the right software/hardware or the object or media is damagedI’m not sure it is the right thing (i.e. is it the authoritative version? has it been changed along the way?)Image Credit: http://www.envelop.eu/shop/patterns/details/p/red-green-and-blue-apples
  • Image Credit: http://www.envelop.eu/shop/patterns/details/p/red-green-and-blue-apples
  • - researchers havelots of iterations of datasets during processing-data uses codes I don’t have any documentation on, what are the variables measuring, what events during the collection phase could have affected the quality of the data, who was surveyed, cryptic variable names, unsure of weighting applied, sources used
  • Some of the solutions are more about statistical information while other are about those common preservation or re-use problemsCould have cumbersome org-specific standard but better to have combination of international standards
  • Use a combination of international standardsThere are a few great benefits of this:This helps us and could help you by saving you from re-inventing the wheel.International experts and great community at your disposalInteroperable data and makes it easier to create shared access points and search and mine across data repositoriesTo describe data, particularly statistical, we use DDI , which is a fairly complex standard for managing and describing dataDDI includes Dublin Core information like titles and creators or authors that helps users find dataPREMIS contains information that will help the archive preserve the data via checksums, file formats, and provenance information
  • Significant characteristics to preserve (e.g. fonts, colors, content only)How do you bring these all together? And what happens if the same information is included in more than one standard?We’ve done some thinking about this and can share our experience and strategies to consider when deciding what to record where!
  • Looking back at our activities, some are more content-specific, i.e. just about data, and others are more general/common preservation activities.
  • If you haven’t started managing your data - you can go back to your desk tomorrow and think about what metadata you could start capturing to support long-term re-use – whether you’re the one with the preservation archive or you’re planning/hoping to hand off your data to someone else. If you have already started managing your data – you can check whether your current practices consider the following things
  • Premis – admin-ey m/d? ddi – descriptive? Other overlap includes DDI Archive module lifecycle events – could contain same info as PREMIS events but this overlap is probably not useful
  • At Statistics NZ, we’re implementing a tool that will allow our statisticians to capture the statistical information as DDI.
  • Don’t ignore data – it’s probably a key part of your core business

Parul Sharma Sally Vermaaten Right Combination Parul Sharma Sally Vermaaten Right Combination Presentation Transcript

  • The Right Combination:Using DDI and PREMIS for datapreservationParul Sharma & Sally Vermaaten March 2012
  • Outline1. The context – drivers for preservation2. The problem – challenges faced when trying to re- use data3. Our solution – metadata for data management &preservation4. Our recommendations– strategies for making the right metadata choices 2
  • 1. THE CONTEXT:DRIVERS FORPRESERVATION 3
  • Data is a cross-domain concernGeospatial dataScientific data Statistical data Financial and commercial data 4
  • There are many drivers for data preservationLegal mandates Cost of dataVerification collectionUniqueness of data Data re-use 5
  • An example of data re-use atStatistics New Zealand 6
  • 2. THE PROBLEM:CHALLENGESFACEDWHENTRYING TO RE-USE DATA 7
  • Common challengesto re-use/preservation of any type of digital object
  • Common challenges to re-use/preservation of any type of digital objectI can’t find itI can’t open it (wrong hardware/software)I’m not sure it is the right thing
  • Unique challengesto re-use/preservation of structured data
  • Unique challenges to re-use/preservation of structured dataI’m not sure it is the authoritative dataI don’t understand the meaning of the data - data isnot self-descriptiveI can’t use the data because I can’t harmonize itwith other data 11
  • 3. OUR SOLUTION: METADATA FOR DATA MANAGEMENT &PRESERVATION 12
  • Our solutions Have subject Archivists putI can’t find the data (common) experts record it in a safe locations placeI can’t open the data (common)I’m not sure it’s theright thing / it’s the authoritative data (particularly hard with data) I don’t understand the meaning of the data (particularly hard with data)I can’t reuse the data because it’s not harmonised (unique to data) 13
  • Our solutions Have subject Archivists putI can’t find the data (common) experts record it in a safe locations place ArchivistsI can’t open the data (common) monitor file formatsI’m not sure it’s theright thing / it’s the authoritative data (particularly hard with data) I don’t understand the meaning of the data (particularly hard with data)I can’t reuse the data because it’s not harmonised (unique to data) 14
  • Our solutions Have subjectI can’t find the data Archivists put it in a experts record (common) safe place locationsI can’t open the data Archivists monitor (common) file formatsI’m not sure it’s theright thing / it’s the Subject experts & Have subject archivists capture authoritative data experts identify key what has happened (particularly hard datasets to the data with data) I don’t understand the meaning of the data (particularly hard with data)I can’t reuse the data because it’s not harmonised (unique to data) 15
  • Our solutions Have subjectI can’t find the data Archivists put it in a experts record (common) safe place locationsI can’t open the data Archivists monitor (common) file formatsI’m not sure it’s theright thing / it’s the Subject experts & Have subject archivists capture authoritative data experts identify key what has happened (particularly hard datasets to the data with data) I don’t understand Have subject the meaning of the Archivists capture experts capture data (particularly or QA metadata important data hard with data)I can’t reuse the data because it’s not harmonised (unique to data) 16
  • Our solutions Have subjectI can’t find the data Archivists put it in a experts record (common) safe place locationsI can’t open the data Archivists monitor (common) file formatsI’m not sure it’s theright thing / it’s the Subject experts & Have subject archivists capture authoritative data experts identify key what has happened (particularly hard datasets to the data with data) I don’t understand Have subject the meaning of the Archivists capture experts capture data (particularly or QA metadata important data hard with data)I can’t reuse the data Archivists and Tools to create because it’s not subject experts more standardised harmonised (unique capture detailed data to data) metadata 17
  • To support these processes…Metadata is keyWe could invent our own standard for recordingmetadata but there is a better way … 18
  • How? PREservation Metadata:Data Documentation Implementation StrategiesInitiative (DDI) (PREMIS) Dublin Core + + Discover ! Preserve! Describe! 19
  • Comparison of standards coverageDublin Core DDI PREMISDiscovery information Surveys and outputs Objects (significantabout a resource (e.g. (Series and Studies) characteristics,Title, Creator, Publication checksums, basicdate) identifying information) Methodology & quality Events (preservation information actions) Classifications used Agents Dataset descriptions Rights Variables used Links to documentation 20
  • Metadata to support re-use I can’t find the Have subject Archivists put it in a data experts record safe place DDI locations PREMIS I’m not sure it’s Have subject Subject experts & archivists capturethe authoritative experts identify key what has happened datasets data to the data I don’t Have subjectunderstand the experts capture Archivists capturemeaning of the important or QA metadata metadata dataI can’t open the Archivists monitor data file formatsI can’t reuse the Archivists and Tools to create subject expertsdata because it’s capture detailed more standardised datanot harmonised metadata 21
  • 4. OUR RECOMMENDATIONS: STRATEGIES FOR MAKING THE RIGHT METADATA CHOICES 22
  • Metadata Top Tips1. Create structures that will allow you to re-use metadata tools2. Use standards that are fit for your content so users can re-use3. Consider overlap between standards so you’re using the right standard for the right job4. Provide standard based tools and capture at point of creation to improve quality and efficiency 23
  • 1. Create structures that will allow you to re-use metadata tools Set yourself up to be able to use the same tools to harvest and mine your metadata (e.g. handy reports, searching across content types) by: – developing a standard structure that can support all your content types – and recording generic information in generic metadata standards 24
  • Data_1500 Database_0120DublinCore.xml Non-format DublinCore.xmlPREMIS.xml specific metadata PREMIS.xmlOriginal Original data.sas7bdat database.mdb questionnaire.doc ArchiveMasterArchiveMaster HeaderData metadata.xsd data.csv Format metadata.xml specific structure &Documentation metadata Content questionnaire.pdf Schema1Metadata Table1 DDI.xml table.xsd table.xml 25
  • 2. Use standards that are fit for your content so users can re-useEnable future re-use and understanding by recording formator content-specific metadata in fit-for-purpose standards e.g. DDI for statistical data SIARD for databases MIX for images 26
  • 3. Consider overlap between standards so you’re using the right standard for the right job Information DDI PREMIS Dublin Core Useful to duplicate? Basic identifying •Title •Title yes information •Creator •Creator •PublicationDate •Date •ID •Identifier Access •Access Conditions •Rights entity •Rights No – PREMIS is information most expressive and generic location 27
  • 4. Provide standard based tools and capture at point of creation to improve quality and efficiencyAt first, you may need to capture or collate allmetadata about data yourselfThink ahead about tools you might be able toprovide to data experts to allow them to record theinformation directly in the standard if possible 28
  • 29
  • Takeaways1. Organisations have many reasons to re-use data over time2. There are unique challenges to preserving data3. Where possible, save yourself some work and make your metadata more harvestable and data more understandable by using international standards like DDI and PREMIS4. When you use metadata standards like DDI and PREMIS together: • create generic structures • use fit-for-purpose standards for specific content • consider information overlap • ‘delegate’ metadata capture where possible 30
  • Thanks!Sally Vermaatensally.vermaaten@stats.govt.nzParul Sharma parul.sharma@stats.govt.nz 31