Archiving Of Electronic Publications - Presentation Transcript
Annegrete Wulff
Statistics Denmark
awu@dst.dk 7. januar 2009
Archiving Internet published statistics
International Marketing and Output Database Conference
Cork, Ireland 24-28 September 2007
Introduction
Slogans like Time for Numbers- Numbers on Time1 or Wissen.Nutzen2, refer
to the fact that statistics should be timely and is the basis for good planning.
Access to the latest figures as soon as they are published is increasingly
important. The use of the Internet makes this possible.
Yesterday’s figures and historic data are, however, also part of the
description of our societies.
Most statistical offices – among them Statistics Denmark – used to have
simple, objective and easy-to-follow rules concerning archiving its data:
− All printed publications were kept in stock with a few copies and the
National Archive received a copy.
− The statistical files behind the books were delivered to the National
Archive
− The documentation belonging to the statistical files was archived there
as well.
During the latest 10 years the Internet has become an ever more important
publishing medium and hard copy publications have diminished in
importance.
What is archived – and what is not?
Our legacy of statistics is still accessible and readable in the archives and in
the libraries as printed books. It does not include all data that passed
throughout our production. It represents everything that was published to
be read by a range of users.
In this paper I exclude the readiness archiving we do in order to secure the
continued production. That means back up procedures of servers with data,
programs and systems will not be taken into account here. I shall focus on
the archiving of data and information disseminated to the public in
electronic form.
Today’s practice
Currently all pdf publications are archived. If major errors are noted a new
Pdf documents
release of the pdf is published and both versions are archived. In the case of
1
Statistics Denmark
2
DSTATIS, Germany
/home/pptfactory/temp/archiving-of-electronic-publications-1231359482416206-1/archiving-of-
electronic-publications-1231359482416206-1.doc
minor errors the existing pdf is overwritten; thus the original is not
archived. The electronic archive is accessible on the Internet and from our
internal server. The archiving of Statistical Abstracts (Daily News) dates
back to 1999.
An in-house developed crawler was put in function in 2005. It is used to
www.dst.dk
discover invalid links on the site as well as to archive the full site.
Crawling and archiving is carried out in accordance with the schedule
mentioned below:
Yearly:
− Snapshot of all pages and sub-pages of the website, including all pdf
files.
Monthly:
− All pages on the site of the Danish version www.dst.dk as well as the
English version www.dst.dk/uk are archived. (Pdf documents are
excluded as they are archived separately). This is a time consuming
process taking approximately 20 hours.
− Snapshot of the StatBank interface on the Danish as well as the English
version. As the StatBank is an interactive databank, figures are retrieved
through user’s selection. Dynamic pages that contain forms, JavaScript
or other elements that require “human interaction” can not be archived.
So neither the functionality of the StatBank nor the data resulting from
a selection are archived this way.
− www.Alexa.com (The web archive) has since 1997 recorded examples of
our web site. In 1997 only three downloads of the site were made. In
2006 it was around 100. They are accessible on the Internet.
Weekly:
− Economic key indicators on the web
Daily:
− www.dst.dk front page of the web site, Danish version
− www.dst.dk/uk front page of the web site, English version
Three times a day
− Figures in the IMF DSBB agreement www.dst.dk/imf
The user interface and layout of the StatBank is archived according to the
www.statbank.dk
procedure described in the previous section. The data in the StatBank,
however, is not saved in that connection. The StatBank is the primary
source for all our published statistics so it would seem logical as the first
thing to secure the archiving of this primary source. However, this is not the
case. Statistics Denmark is preparing for a set of rules regarding archiving.
Considering this we will balance the costs against the usefulness.
An error in a table may turn up after data has been published. As a result
the data needs to be corrected. There are two ways of handling this:
1. A new file with corrected data is loaded and the original file is
“unpublished” but still kept.
2. A new file with corrected data is loaded and overwrites the original file.
Both methods are used, although the one where all loads are kept is the
more common. Data is stored (even loaded data which was never
published) but only the period or part of the file that is actually updated.
The file will also contain some metadata – but only codes. As the archived
files are not stored in the macro database environment, reading these files
may be misleading if the metadatabase has been changed over time.
2
The fact that all erroneous published figures are not archived and available
has not been regarded a huge problem so far. Statistics Denmark considers
the corrected figures to be the ones of interest for the majority of the users.
Moreover it might disturb the majority of users if also the erroneous data
would be available - just to satisfy a very small minority. Never the less,
when resources in our unit permit, we should pay attention to an archiving
method of these files that makes it possible to access them in a better way
without interfering with the corrected data.
It should be mentioned that series and time periods holding correct data
are never deleted from the StatBank.
Should everything that we publish be available to the public in the future,
we need to take the following “products” into account:
− All databank tables – every single update and revisions
− All versions of pdf documents
− Every single page of the web site
− Electronic, interactive publications – all updates.
Should we choose an ideal or a pragmatic solution? Will the archiving
activity be enormous? Can we archive in a way that allows us to still retrieve
and access the information?
These are some of the challenges we need to solve.
Why do we archive?
There may be a range of reasons for an organisation to archive the products
and activities. Some are “need to have” while others may be classified as
“nice to have”.
Pdf publications follow the same rules as printed publications. We are
Legal obligations
obliged to deliver a copy to the Royal Library. We keep anyway a copy in
our own archive as well. From our archive the pdf is accessible to the public,
while this is not the case in the Royal Library.
There are no legal obligations on any of the other electronically published
products
Timeliness is an important quality indicator. However, not only the latest
Historical interest for data
updated statistics are of interest. Historians and others with an interest in
the past often need to complement their research with statistical data. In
this way our output database StatBank grows larger and larger as we do not
delete even statistical series and subjects which are no longer updated. To
keep the databank manageable to the public we now and then have to
review the presented structure of subjects and tables.
One may also mention the interest of being able to check scientific research
or analysis done by another researcher in the past. Then you need to be able
to find the data as it was when this researcher did his analysis.
The website of Statistics Denmark is automatically loaded with selected
data from the StatBank. These are Economic Key figures, Data in Focus etc
presented in HTML tables. These summary tables are used frequently of
laymen as well as professionals. They are visited far more often than our pdf
3
documents. They are archived only as part of the homepage. We should
consider archiving these tables selectively.
The website of Statistics Denmark is changing according to the content,
Historical interest for set up
products, values in focus etc. But also according to “state of art” and “best
and layout
practice” (sometimes called fashion). There is no obligation to archive these
changes of style, the obligation lays with The Royal Library, who archives
the domain .dk in total. However, only once or twice a year. Their archive is
not available to the public because of protection of personal integrity.
Nevertheless it is of interest to document the preferences Statistics
Denmark made in the past. What was seen as important news? What did we
focus on? Did we prioritize some users to others? How has the presentation
of the organisation changed over time? This kind of archiving is of the kind
“nice to have”.
Conclusion
Statistics Denmark is today archiving in accordance to all legal obligations.
We are doing even more. Still we need considering what are the needs in
the future. If we shall fulfil such needs we have to start archiving already
today.
Instead of archiving everything that is made available on the web, another
possibility would be to archive exactly what the users have been looking at.
Today all retrievals from the StatBank are kept in temporary files. These
files are deleted the following night. –If we moved these files to a
permanent archive we would have an exact picture of what was actually
displayed and looked at by the public. This is an opportunity that paper
media does not give: we know how many books leave our shop but we do
not know if every page is read at all or if it is read several times by more
users.
As a bonus such an archive could draw a picture of the interest different
themes have had over the years.
4
0 comments
Post a comment