2. Selection Policy
Selection policy is determined by a number of different factors
- Remit and Mission of the collecting organization
- Intellectual property rights issue
- Institutional resources available
3. Usage of Web Archiving Program
Web archiving program can be implemented by a variety of
institutions
- Libraries
- Archives
- Research Organization
- Learned societies
- Commercial Organization
5. Objectives of this Topic
• A model process for making selection decisions
• The context in which the decision is going to take
place
• Possible approaches for selection
• Selection Criteria
• Element required to create the selection and
collection list
6. The Selection Process
The selection process can be broken down into smaller components
POLICY DEFINITION
SELECTION POILICY
MAINTAINCE
SELECTION
COLLECTION LIST
QUALITY ASSURANCE
COLLECTION
8. Selection Policy
A well defined selection policy is an essential foundation for any
web archiving program. The nature of the policy depends upon
the individual organizational requirements but its formulation
will typically require the following steps
- Context
- Selection Methods
- Selection Criteria
9. Selection Context
An understanding of the broader context in which the selection
policy is going to work is the main part to the formulation of the
policy itself.
10. SELECTION METHODS
A number of different approaches for selection are possible
which may be categorized according to there scope.
- Unselective Approach
- Selective Approach
- Thematic Approach
11. Unselective Approach
In this approach one has to take a decision not to select but rather
to collect everything possible. It is based on 4 main arguments
1- Complete Contents of the Resource (Interconnectedness)
2- Expensive and Time Consuming
3- Technical Feasibility (Deep Web / Surface Web)
12. Thematic Approach
It is also called as Semi Selective Approach. In this approach 4
points are considerable
1- Subject: Selection according to the name of the Domain
2- Creator: Selection according to the name of the creator of that
web Resource. It may be any govt agency, Publisher
3- Genre: The scope of selection may be according to the
specific genre of resource, such as publications, blog,
web art or govt records.
4- Domain :The scope of selection defined in terms of specific
web domains such as “.uk” , “.edu”
13. Selective Approach
The most narrowly defined selection method is to identify specific
web resource for collection, such as single web publication or
website.
14. SELECTION CRITERIA
Once the selection method is identified than a set of specific
selection criteria will be finalized. The criteria allow individual
selection decision to be made which in turn be translated into a
list of web resources to be collected. The criteria based on 3
issues
- Content
- Extent
- Timing and Frequency
15. Content
Criteria must be established to define the nature of the web resources eligible
for selection in terms of their intellectual context.
Extent
To establish criteria for determining the extent of selected resources.
Example
It may be stated that no external links from websites will be collected.
Timings and Frequency
The timing and frequency of collection of each selected web resources should
be clearly defined in the collection list. These may be influenced by a number
of factors. These are
1- Lifecycle 3- Risk Assessment
2- Topicality/ Significance 4- Rate of Content Change
16. Lifecycle
The nature of the web resources may be defined in terms of its
active lifecycle, which may be open ended or limited duration.
Example: Many websites may exist and evolve over an indefinite
period of time and some event based websites may
have a planned completion point, after which the
content becomes fixed and the website may even cease
to be maintained.
17. Rate of Change
A web resource content may be dynamic or fixed. Some websites
or individual pages may remain static for months, whereas other
may change enormously per day.
Example: A typical website page may be updated on a regular
basis, whereas a journal article may be published to the
web in a finished form.
The rate of change will therefore be an important factor in
determining the frequency with which a resource should be
collected.
18. Risk Assessment
The selection policy should identify the types of risks monitored
which can effect on specific resource
Example :The name and version of a web server can be
identified through analysis of the HTTP header
generated by the site.
Automated tools can be used to monitor the availability of a
website and track the frequency and duration of any downtime.
The use of outdated web server and occurrence of frequent
periods of downtime could be indicators of poor management
practices and therefore signifies a high degree of risk.
19. Topicality and Significance
A major factor in determining the frequency of collection may
be
a subjective assessment of the topicality or underlying
significance of a given resource.
Example :The “National Archives” collects the majority of UK
govt websites on a biannual basis. However, for the
duration of the iraq conflict in 2004, it prioritized
websites related to defense and foreign policy for high
intensity collection at weekly intervals.
20. Defining the Boundaries
Once the selection policy is implemented, it will generate a list of web
resources to be collected. This list may be contained in the selection policy
itself., if it is static, or exist as freestanding document, if it is dynamic. In
selection policy the boundaries of each selected web resource should be
defined to allow it to be collected.
Web Resources are defined in terms of a uniform resource locator (URL),
which provides a unique address for that resource within the world wide web.
A URL comprises of following elements..
1- Scheme
2- Domain
3- Path
21. Scheme
Scheme defines the format of the URL, which usually use a
communication protocol such as the Hypertext Transfer Protocol
(HTTP) or the File Transfer Protocol (FTP).
Scheme : http://
22. Domain
Defines the host for the web resource. This comprises of two or more labels
separated by dots ‘.’ and is read from right to left.
Domain Name: WWW.NATIONALARCHIVE.GOV.UK/
The right most label is the top level domain, which specifies either a country code
(such as uk for United Kingdom) or (.com for Commercial Organization).
The label to this left is the second level domain which is generally describing the
name of the hosting organization, (microsoft.com) or (gov.uk).
Labels to the left of this may be used to define further domain and sub domain
levels.
23. Domain Name
The domain name must be translated into an internet protocol
(IP) address which uniquely identify each host computer on the
internet. This translation is performed by a “Domain Name
System” DNS server which maintained a record of domain names
and IP address.
24. Path
The path specifies the location of the web resource within the
directory of structure of the host web server and is read from left
to right.
PATH : preservation/webarchive/default.htm
In the above example the URL points to a file called default.htm
located within the directory path preservation/webarchive/
hosted in the “National Archives.gov.uk” with the
host web server WWW
25. Timings and Frequency of Collection
The collection list must define the timings and the frequency with
which each selected web resource is to be collected. Four basic
scenarios are possible
- Repeated Collection
- Ad-Hoc Collection
- One Off Collection
- Comprehensive Collection
26. Repeated Collection
In this approach the web resource is collected at repeated
intervals. This approach is suitable for dynamic resource
collection with open ended lifecycles. To capturing changes can
be collected by using static “snapshots” of the resource according
to the collection policy defines. The normal practices are weekly,
monthly or annually.
A decision has to make that what kind of criteria we are
following in the selection policy for the collection of dynamic
resources. There are two kinds of collection techniques
- Incremental
- Complete
27. Ad-Hoc Collection
Web resources may change at un predictable rates. Where this is
the case, repeated collection at a fixed frequency may prove
inefficient, resulting in the repeated collection of the same
content. An alternative approach is to collect in response to a
trigger event, such as some form of automated or manual
monitoring of the resources or an alert from some external
source.
28. One Off Collection
In some cases a specific web resource may be selected for collection on a one
off basis. This will typically apply to resources which have fixed content, such
as online publication. In addition, certain types of material may change over a
set of period of time, and then stabilize in a fixed form.
Example :
The website for government public inquiry which may change rapidly while
the inquiry is in progress and new content is being added, but will then
become
fixed once the inquiry has published its findings. In such cases it may be
considered appropriate to collect the site only once it has become fixed. If the
changes are significant or material is being removed as well as added, repeated
collection may also be required during the dynamic stage of the resources
lifecycle.
29. Comprehensive Collection
It may be necessary in some case to capture complete lifecycle of
a dynamic and open ended web resource. It is mostly required
where online transactions needs to be preserved for evidential
purpose. In such a case collection for archival purposes will
needs to be integrated within the website management workflow.
This is least commonly applied selection approach.
30. Maintaince
The selection policy must be updated it should not be static. It should reflect
the changes in the internal and external factors such as new organization
priorities and developments in the world wide web. Equally the collection list
whether a part of selection policy or not will clearly be dynamic. Feed back
from the quality assurance of web resources should be used to refine
the selection process.
Example :
1- When resources are collected, new resources may be identified
that need to be considered for selection.
2- There must be lessons to be learned, the availability of the organizational
resources and infrastructure, including specific strengths or limitations of
the available collection technologies.
31. Maintaince
The regularity with which the maintaince needs to be undertaken
will depends upon the selection method adopted, and the
frequency of collection.
A clearly and well maintained selection policy is the most
important part of any web archiving program and an essential
pre-requisite for building a coherent and meaningful collection.