SELECTION for Web Archiving Programme


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

SELECTION for Web Archiving Programme

  1. 1. SELECTION ForWeb Archiving Programme
  2. 2. Selection PolicySelection policy is determined by a number of different factors- Remit and Mission of the collecting organization- Intellectual property rights issue- Institutional resources available
  3. 3. Usage of Web Archiving ProgramWeb archiving program can be implemented by a variety ofinstitutions- Libraries- Archives- Research Organization- Learned societies- Commercial Organization
  4. 4. Quality of Web- Interconnectedness- Dynamic Nature
  5. 5. Objectives of this Topic• A model process for making selection decisions• The context in which the decision is going to take place• Possible approaches for selection• Selection Criteria• Element required to create the selection and collection list
  6. 6. The Selection ProcessThe selection process can be broken down into smaller components POLICY DEFINITIONSELECTION POILICY MAINTAINCE SELECTION COLLECTION LIST QUALITY ASSURANCE COLLECTION
  7. 7. STAGES IN SELECTION PROCESS• Selection Policy• Selection• Maintaince
  8. 8. Selection PolicyA well defined selection policy is an essential foundation for anyweb archiving program. The nature of the policy depends uponthe individual organizational requirements but its formulationwill typically require the following steps- Context- Selection Methods- Selection Criteria
  9. 9. Selection ContextAn understanding of the broader context in which the selectionpolicy is going to work is the main part to the formulation of thepolicy itself.
  10. 10. SELECTION METHODSA number of different approaches for selection are possiblewhich may be categorized according to there scope.- Unselective Approach- Selective Approach- Thematic Approach
  11. 11. Unselective ApproachIn this approach one has to take a decision not to select but ratherto collect everything possible. It is based on 4 main arguments1- Complete Contents of the Resource (Interconnectedness)2- Expensive and Time Consuming3- Technical Feasibility (Deep Web / Surface Web)
  12. 12. Thematic ApproachIt is also called as Semi Selective Approach. In this approach 4points are considerable1- Subject: Selection according to the name of the Domain2- Creator: Selection according to the name of the creator of that web Resource. It may be any govt agency, Publisher3- Genre: The scope of selection may be according to the specific genre of resource, such as publications, blog, web art or govt records.4- Domain :The scope of selection defined in terms of specific web domains such as “.uk” , “.edu”
  13. 13. Selective ApproachThe most narrowly defined selection method is to identify specificweb resource for collection, such as single web publication orwebsite.
  14. 14. SELECTION CRITERIAOnce the selection method is identified than a set of specificselection criteria will be finalized. The criteria allow individualselection decision to be made which in turn be translated into alist of web resources to be collected. The criteria based on 3issues- Content- Extent- Timing and Frequency
  15. 15. ContentCriteria must be established to define the nature of the web resources eligible for selection in terms of their intellectual context. Extent To establish criteria for determining the extent of selected resources. Example It may be stated that no external links from websites will be collected. Timings and FrequencyThe timing and frequency of collection of each selected web resources shouldbe clearly defined in the collection list. These may be influenced by a numberof factors. These are1- Lifecycle 3- Risk Assessment2- Topicality/ Significance 4- Rate of Content Change
  16. 16. LifecycleThe nature of the web resources may be defined in terms of itsactive lifecycle, which may be open ended or limited duration.Example: Many websites may exist and evolve over an indefinite period of time and some event based websites may have a planned completion point, after which the content becomes fixed and the website may even cease to be maintained.
  17. 17. Rate of ChangeA web resource content may be dynamic or fixed. Some websitesor individual pages may remain static for months, whereas othermay change enormously per day.Example: A typical website page may be updated on a regular basis, whereas a journal article may be published to the web in a finished form. The rate of change will therefore be an important factor in determining the frequency with which a resource should be collected.
  18. 18. Risk AssessmentThe selection policy should identify the types of risks monitored which can effect on specific resourceExample :The name and version of a web server can be identified through analysis of the HTTP header generated by the site.Automated tools can be used to monitor the availability of awebsite and track the frequency and duration of any downtime.The use of outdated web server and occurrence of frequentperiods of downtime could be indicators of poor managementpractices and therefore signifies a high degree of risk.
  19. 19. Topicality and SignificanceA major factor in determining the frequency of collection may bea subjective assessment of the topicality or underlyingsignificance of a given resource.Example :The “National Archives” collects the majority of UK govt websites on a biannual basis. However, for the duration of the iraq conflict in 2004, it prioritized websites related to defense and foreign policy for high intensity collection at weekly intervals.
  20. 20. Defining the BoundariesOnce the selection policy is implemented, it will generate a list of webresources to be collected. This list may be contained in the selection policyitself., if it is static, or exist as freestanding document, if it is dynamic. Inselection policy the boundaries of each selected web resource should bedefined to allow it to be collected.Web Resources are defined in terms of a uniform resource locator (URL),which provides a unique address for that resource within the world wide web.A URL comprises of following elements..1- Scheme2- Domain3- Path
  21. 21. SchemeScheme defines the format of the URL, which usually use acommunication protocol such as the Hypertext Transfer Protocol(HTTP) or the File Transfer Protocol (FTP). Scheme : http://
  22. 22. DomainDefines the host for the web resource. This comprises of two or more labelsseparated by dots ‘.’ and is read from right to left. Domain Name: WWW.NATIONALARCHIVE.GOV.UK/The right most label is the top level domain, which specifies either a country code(such as uk for United Kingdom) or (.com for Commercial Organization).The label to this left is the second level domain which is generally describing thename of the hosting organization, ( or ( to the left of this may be used to define further domain and sub domainlevels.
  23. 23. Domain Name The domain name must be translated into an internet protocol (IP) address which uniquely identify each host computer on the internet. This translation is performed by a “Domain NameSystem” DNS server which maintained a record of domain names and IP address.
  24. 24. PathThe path specifies the location of the web resource within thedirectory of structure of the host web server and is read from leftto right. PATH : preservation/webarchive/default.htmIn the above example the URL points to a file called default.htmlocated within the directory path preservation/webarchive/hosted in the “National” with thehost web server WWW
  25. 25. Timings and Frequency of CollectionThe collection list must define the timings and the frequency withwhich each selected web resource is to be collected. Four basicscenarios are possible- Repeated Collection- Ad-Hoc Collection- One Off Collection- Comprehensive Collection
  26. 26. Repeated CollectionIn this approach the web resource is collected at repeatedintervals. This approach is suitable for dynamic resourcecollection with open ended lifecycles. To capturing changes canbe collected by using static “snapshots” of the resource accordingto the collection policy defines. The normal practices are weekly,monthly or annually.A decision has to make that what kind of criteria we arefollowing in the selection policy for the collection of dynamicresources. There are two kinds of collection techniques- Incremental- Complete
  27. 27. Ad-Hoc CollectionWeb resources may change at un predictable rates. Where this isthe case, repeated collection at a fixed frequency may proveinefficient, resulting in the repeated collection of the samecontent. An alternative approach is to collect in response to atrigger event, such as some form of automated or manualmonitoring of the resources or an alert from some externalsource.
  28. 28. One Off CollectionIn some cases a specific web resource may be selected for collection on a oneoff basis. This will typically apply to resources which have fixed content, suchas online publication. In addition, certain types of material may change over aset of period of time, and then stabilize in a fixed form.Example :The website for government public inquiry which may change rapidly whilethe inquiry is in progress and new content is being added, but will then becomefixed once the inquiry has published its findings. In such cases it may beconsidered appropriate to collect the site only once it has become fixed. If thechanges are significant or material is being removed as well as added, repeatedcollection may also be required during the dynamic stage of the resourceslifecycle.
  29. 29. Comprehensive CollectionIt may be necessary in some case to capture complete lifecycle ofa dynamic and open ended web resource. It is mostly requiredwhere online transactions needs to be preserved for evidentialpurpose. In such a case collection for archival purposes willneeds to be integrated within the website management workflow.This is least commonly applied selection approach.
  30. 30. MaintainceThe selection policy must be updated it should not be static. It should reflectthe changes in the internal and external factors such as new organizationpriorities and developments in the world wide web. Equally the collection listwhether a part of selection policy or not will clearly be dynamic. Feed backfrom the quality assurance of web resources should be used to refinethe selection process.Example :1- When resources are collected, new resources may be identified that need to be considered for selection.2- There must be lessons to be learned, the availability of the organizational resources and infrastructure, including specific strengths or limitations of the available collection technologies.
  31. 31. MaintainceThe regularity with which the maintaince needs to be undertakenwill depends upon the selection method adopted, and thefrequency of collection.A clearly and well maintained selection policy is the mostimportant part of any web archiving program and an essentialpre-requisite for building a coherent and meaningful collection.