Archiving Of Electronic Publications

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Archiving Of Electronic Publications - Presentation Transcript

    1. Annegrete Wulff Statistics Denmark awu@dst.dk 7. januar 2009 Archiving Internet published statistics International Marketing and Output Database Conference Cork, Ireland 24-28 September 2007 Introduction Slogans like Time for Numbers- Numbers on Time1 or Wissen.Nutzen2, refer to the fact that statistics should be timely and is the basis for good planning. Access to the latest figures as soon as they are published is increasingly important. The use of the Internet makes this possible. Yesterday’s figures and historic data are, however, also part of the description of our societies. Most statistical offices – among them Statistics Denmark – used to have simple, objective and easy-to-follow rules concerning archiving its data: − All printed publications were kept in stock with a few copies and the National Archive received a copy. − The statistical files behind the books were delivered to the National Archive − The documentation belonging to the statistical files was archived there as well. During the latest 10 years the Internet has become an ever more important publishing medium and hard copy publications have diminished in importance. What is archived – and what is not? Our legacy of statistics is still accessible and readable in the archives and in the libraries as printed books. It does not include all data that passed throughout our production. It represents everything that was published to be read by a range of users. In this paper I exclude the readiness archiving we do in order to secure the continued production. That means back up procedures of servers with data, programs and systems will not be taken into account here. I shall focus on the archiving of data and information disseminated to the public in electronic form. Today’s practice Currently all pdf publications are archived. If major errors are noted a new Pdf documents release of the pdf is published and both versions are archived. In the case of 1 Statistics Denmark 2 DSTATIS, Germany /home/pptfactory/temp/archiving-of-electronic-publications-1231359482416206-1/archiving-of- electronic-publications-1231359482416206-1.doc
    2. minor errors the existing pdf is overwritten; thus the original is not archived. The electronic archive is accessible on the Internet and from our internal server. The archiving of Statistical Abstracts (Daily News) dates back to 1999. An in-house developed crawler was put in function in 2005. It is used to www.dst.dk discover invalid links on the site as well as to archive the full site. Crawling and archiving is carried out in accordance with the schedule mentioned below: Yearly: − Snapshot of all pages and sub-pages of the website, including all pdf files. Monthly: − All pages on the site of the Danish version www.dst.dk as well as the English version www.dst.dk/uk are archived. (Pdf documents are excluded as they are archived separately). This is a time consuming process taking approximately 20 hours. − Snapshot of the StatBank interface on the Danish as well as the English version. As the StatBank is an interactive databank, figures are retrieved through user’s selection. Dynamic pages that contain forms, JavaScript or other elements that require “human interaction” can not be archived. So neither the functionality of the StatBank nor the data resulting from a selection are archived this way. − www.Alexa.com (The web archive) has since 1997 recorded examples of our web site. In 1997 only three downloads of the site were made. In 2006 it was around 100. They are accessible on the Internet. Weekly: − Economic key indicators on the web Daily: − www.dst.dk front page of the web site, Danish version − www.dst.dk/uk front page of the web site, English version Three times a day − Figures in the IMF DSBB agreement www.dst.dk/imf The user interface and layout of the StatBank is archived according to the www.statbank.dk procedure described in the previous section. The data in the StatBank, however, is not saved in that connection. The StatBank is the primary source for all our published statistics so it would seem logical as the first thing to secure the archiving of this primary source. However, this is not the case. Statistics Denmark is preparing for a set of rules regarding archiving. Considering this we will balance the costs against the usefulness. An error in a table may turn up after data has been published. As a result the data needs to be corrected. There are two ways of handling this: 1. A new file with corrected data is loaded and the original file is “unpublished” but still kept. 2. A new file with corrected data is loaded and overwrites the original file. Both methods are used, although the one where all loads are kept is the more common. Data is stored (even loaded data which was never published) but only the period or part of the file that is actually updated. The file will also contain some metadata – but only codes. As the archived files are not stored in the macro database environment, reading these files may be misleading if the metadatabase has been changed over time. 2
    3. The fact that all erroneous published figures are not archived and available has not been regarded a huge problem so far. Statistics Denmark considers the corrected figures to be the ones of interest for the majority of the users. Moreover it might disturb the majority of users if also the erroneous data would be available - just to satisfy a very small minority. Never the less, when resources in our unit permit, we should pay attention to an archiving method of these files that makes it possible to access them in a better way without interfering with the corrected data. It should be mentioned that series and time periods holding correct data are never deleted from the StatBank. Should everything that we publish be available to the public in the future, we need to take the following “products” into account: − All databank tables – every single update and revisions − All versions of pdf documents − Every single page of the web site − Electronic, interactive publications – all updates. Should we choose an ideal or a pragmatic solution? Will the archiving activity be enormous? Can we archive in a way that allows us to still retrieve and access the information? These are some of the challenges we need to solve. Why do we archive? There may be a range of reasons for an organisation to archive the products and activities. Some are “need to have” while others may be classified as “nice to have”. Pdf publications follow the same rules as printed publications. We are Legal obligations obliged to deliver a copy to the Royal Library. We keep anyway a copy in our own archive as well. From our archive the pdf is accessible to the public, while this is not the case in the Royal Library. There are no legal obligations on any of the other electronically published products Timeliness is an important quality indicator. However, not only the latest Historical interest for data updated statistics are of interest. Historians and others with an interest in the past often need to complement their research with statistical data. In this way our output database StatBank grows larger and larger as we do not delete even statistical series and subjects which are no longer updated. To keep the databank manageable to the public we now and then have to review the presented structure of subjects and tables. One may also mention the interest of being able to check scientific research or analysis done by another researcher in the past. Then you need to be able to find the data as it was when this researcher did his analysis. The website of Statistics Denmark is automatically loaded with selected data from the StatBank. These are Economic Key figures, Data in Focus etc presented in HTML tables. These summary tables are used frequently of laymen as well as professionals. They are visited far more often than our pdf 3
    4. documents. They are archived only as part of the homepage. We should consider archiving these tables selectively. The website of Statistics Denmark is changing according to the content, Historical interest for set up products, values in focus etc. But also according to “state of art” and “best and layout practice” (sometimes called fashion). There is no obligation to archive these changes of style, the obligation lays with The Royal Library, who archives the domain .dk in total. However, only once or twice a year. Their archive is not available to the public because of protection of personal integrity. Nevertheless it is of interest to document the preferences Statistics Denmark made in the past. What was seen as important news? What did we focus on? Did we prioritize some users to others? How has the presentation of the organisation changed over time? This kind of archiving is of the kind “nice to have”. Conclusion Statistics Denmark is today archiving in accordance to all legal obligations. We are doing even more. Still we need considering what are the needs in the future. If we shall fulfil such needs we have to start archiving already today. Instead of archiving everything that is made available on the web, another possibility would be to archive exactly what the users have been looking at. Today all retrievals from the StatBank are kept in temporary files. These files are deleted the following night. –If we moved these files to a permanent archive we would have an exact picture of what was actually displayed and looked at by the public. This is an opportunity that paper media does not give: we know how many books leave our shop but we do not know if every page is read at all or if it is read several times by more users. As a bonus such an archive could draw a picture of the interest different themes have had over the years. 4
    SlideShare Zeitgeist 2009

    + annegreteannegrete Nominate

    custom

    168 views, 0 favs, 0 embeds more stats

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 168
      • 168 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 0
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories

    Tags