COLLECTION METHODS WWW = Interplay between = Web Client + Web ServerWeb server stores contents in HTML pages or images which itcan delivers/serves to a web browser in response to the request ofthat web browser.A web browser request content from web server and than providethe received contents to the user.
Mechanism of InteractionThe protocol defines the standard format for communication betweenthe server and the browser.Example :The most commonly used protocol on the web is HTTP (hyper texttransfer protocol). When a browser sends a request to the web serverthat request takes the format of HTTP message. Same reply would bedone from the server side.
URL (Unified Resource Locator)All the contents on the web server is identifiedby using a uniform resource locator (URL). Areference which describe where the content on web is located.
Fundamental Categories of CollectionThere two types of collection techniques1- Content driven collection methods2- Event driven collection methods
Content Driven Collection Methods Seek to archive the underlying content of the website. Event Driven Collection Methods Collect the actual transaction that occurFurther distinctions can be made based on the source from whichthe contents is collected. It can be archived from the1- Web Server (Server Side Collection)2- Web Browser (Client side collection )
Applicability of ApproachDepends upon the type of websites- Dynamic Websites- Static Websites
Static WebsitesA static website consist of a series of pre existing web pages,each of which is linked to from at least one other page. Each webpage is typically composed of one or more individual elements.The structure will contained within the HTML document whichcontain hyperlinks to other elements, such as images and otherpages.All elements of the website can be stored in a hierarchal folderstructure on the web server and the URL describes the locationof each element within that structure.
Form of URLThe target of a hyperlink is normally specified in the “HREF”attribute of an HTML element and defines the URL of the targetresource.The form of the URL may be absolute or relative. These can befurther illustrated by using the following examples
Absolute and Relative In absolute a fully qualified domain and the path name is mentioned.<Ahref = http://www.mysite.com/products/new.html>NewProducts </A>In relative, only including the path name relative to the source object is mentioned.<A href= “new.html” > New Products</A>
Dynamic WebsitesIn a dynamic website the pages are generated from smallerelements of contents. When a request is received the requiredelements are assembled into a web page and delivered. Types ofdynamic contents are- Databases- Syndicated Content- Scripts- Personalization
DatabasesThe content used to create web pages is often stored in a database,such as a Content Management System, and dynamically assembledinto web pages. ScriptsScripts may be used to generate the dynamic contents, respondingdifferently depending on the values of certain variables, such as thedate, type of browser making the request, or identity of the user.
Syndicated Content A website may include content which is drawn from external resources, such as pop ups or RSS feeds and than dynamically inserted into the web pages. PersonalizationMany websites make increasing use of personalization, to delivercontent which is customized to an individual user.Example : Cookies may be used to store information about a user’s computer and returned by their browser whenever they make a request to that website.
Depending on the nature of a dynamic website these virtual pagesmay be linked to from other pages or may only be availablethrough searching. Websites may contain both static and dynamicelements.Example:The home page and other pages that only change infrequently,may be static, whereas pages are updated on a regular basis suchas product catalogue, may be dynamic
The Matrix of Collection MethodThe range of possible methods for collecting web content is dictated by theseconsiderations. Four alternative collection methods are currently available.Table 4.1 The Matrix of Collection Methods Content Driven Event Driven Client Side Remote Harvesting No method available Server Side Direct Transfer Transactional Archiving Database Archiving
Direct Transfer The simplest method of collecting web resources is to acquire a copy of the data directly from the original resource. Thisapproach which requires direct access to the host web server, andtherefore the co-operation of the website owner, involves copyingthe selected resources from the web server and transferring them to the collecting institution, either on removable device such as CD, or online using email FTP.
Direct TransferDirect transfer is most suited for static websites which onlycomprise HTML documents and other objects stored in ahierarchal folder structure on the web server. The whole or a partwebsites can be acquired simply by copying the relevant files andfolders to the collecting institutions storage system.The copies website will function in precisely the same way as theoriginal one but with two limitations.- The hyperlinks should be relative not absolute- Any functionality in the original website will no longer be operable unless the appropriate search engine is installed in the new environment.
StrengthsThe principal advantage of the direct transfer method is that itpotentially offers the most authentic rendition of the collectedwebsite. By collecting from source, it is possible to ensure thatthe complete content is captured with its original structure. Ineffect the collecting institution re-host a complete copy of theoriginal website. The degree of authenticity which it is possible torecreate will depend upon the complexities of the technicaldependencies, and the extent to which the collecting institution iscapable of reproducing them.
LimitationsThe major limitation of this approach are- The resources required to effect each transfer, and sustainability of the supporting technologies.- This method requires cooperation on the part of the website owner, to provide both the data and the necessary documentation.
Go through theCase Study: Bristol Royal Infirmary Inquiry See on page : 48 from the book
Database ArchivingThe increase use of web database have made the development ofthe new web archiving tools a priority and such tools are nowbeginning to appear. The process of archiving database drivensites involved three stages…1- The repository defines a standard data model and format for archived database.2- Each source database is converted to that standard format.3- A standard access interface is provided to the archived database.
Database FormatThe obvious technology to use for define archiving databaseformat is XML, which is an open standard specifically designedfor transforming data structures. Several tools are available whichconverts the proprietary database to XML format.
Tools for Conversion in XML- SIARD = Swiss Federal Archive- DEEPARC = Bibliotheque Nationale de FranceBoth of these tools allow the structure and content of a relational database to be exported into standard formats.
SIARDThe workflow of SIARD is1- Automatically analysis and maps the database structure of the source database.2- Export the definition of the database structure as a text file containing the data definition described using SQL.3- The content is exported as plain text files together with any large binary objects stored in the database and the metadata is exported as a XML document.4- The data can then be related into any relational database management system to provide access.
DeepArc- It enables a user to map the relational database model of the original database to an XML schema and then export the context of the database into an XML document.- It is intended to be used by the database owner since its use in any particular case requires detailed knowledge of the underlying structure of the database being archived.
Flow of Work of DeepArc Tool• First the user creates a view of the database called skeleton which is created by using XML• That skeleton describe the desired structure of the XML documents that will be generated from the database.• The user than builds the associations to map the database to this view.
• This entails mapping both the database structure (i.e. the tables) and the contents (i.e columns within that tables) once these associations have been creaed and configured the user can then export the content of the database into XML document which conforms to the defined schema.• If the collecting institution defines a standard XML data model for its archived database, it can therefore use a tool such as DeepArc to transform each database to that structure.
StrengthsIt offers a generic approach to collecting and preserving databasecontent which avoids the problems of supporting multipletechnologies incurred by alternative approach of direct transfer.This limits issues of preservation and access to a single format,against which all resources cab be brought to bear. For example,archives can use standard access interfaces such as that providedby the XINQ Tool
Limitations• Web database archiving tools are a recent development and are therefore still technology immature compared to some other collection methods.• Supporting Technologies is currently limited• Nature and timings of collection• Original ‘look and feel’. (It should collect the website rather than the collection of database content)• Active cooperation and participation of website owner
Remote Harvesting TechniqueRemote Harvesting is the most common and mostwidely employed method for collecting websites. It involves the use of web crawler software to harvest content from remote web servers. ‘Crawlers’ are software programs designed tointeract with the online services like human users, principally to gather information of the required content. Most of the search engine use these crawlers to collect and index web pages.
Web CrawlerA web crawler shares many similarities with a desktop web browser, itsubmits the HTTP request to a web server and stores the content thatit receives in return. The actions of the web crawler are dictated by alist of URL’s (or seeds) to visit. The crawler visits the first URL on thelist and collects the web page, identifies all the hyperlinks within thepage, and adds them to the seed list.In this way, a web crawler that begins on the home page of a web sitewill eventually visit every linked page within that website. This isrecursive process and is normally controlled by certain parameters,such as number of hyperlinks that should be followed.
InfrastructureThe infrastructure required to operate a web crawler can beminimal; the software simply needs to be installed on a computersystem within an available internet connection and sufficientstorage space for the collected data. However in most large scalearchiving programmes, the crawler software is deployed fromnetworked servers with attached disk or tape storage.
Types of Web CrawlersThere is a wide variety of web crawlers software available, bothproprietary and open source. Three most widely used webcrawlers are1- HTTrack2- NEDLIB Harvester3- HeritrixWe have already discuss these web crawlers in the first lecture Iwill not discuss here in this lecture again
ParametersWeb Crawlers provide a number of parameters can be set tospecify their exact behavior. Many crawlers are highlyconfigurable, offering a very wide variety of settings. Mostcrawlers provide variations on the following parameters.- Connection- Crawl- Collection- Storage- Scheduling settings
Connection SettingsThese setting relate to the manner in which the crawler connectsto web servers.- Transfer Rate- Connections- Transfer Rate: The maximum rate at which the crawler will attempt to transfer the data. In this way a specific transfer rate is specified so that the data is captured at a sufficient rate to enable an entire site to be collected in a reasonable timescale.- Connections: to specify the number of simultaneous connections the web crawler can attempt to make with a host, or the delay between the establishing connections
Crawl SettingsThese settings allow the user to control the behavior of thecrawler as it traverse a website, such as the direction and depth ofthe crawl- Link depth and Limits- Robot Exclusion Notices- Link DiscoverySettings will normally be available to control the size andduration of the crawl. For example, it may be desirable to halt acrawl after it has collected a given volume of data, or within agiven timeframe.
• Link depths and Limits: This will determine the number of links that the crawler should follow away from its starting point, and the direction in which it should move. It is possible to determine the limit of the crawler in terms of whether or not the crawler is restricted to follow links within the same path, website or domain, and to what depth.• Robot Exclusion Notice: A robot exclusion notice is a method used by websites to control the behavior of robots such as web crawlers. It uses a standard protocol to define which parts of a website are accessible to the robot. These rules are contained within a ‘robots.txt’ sile in the top level folder of the website• Link Discovery:The user may also be able to configure how the crawler analysis hyperlinks: these links may be dynamically constructed by scripts, or hidden within content such as flash files and therefore not transparent to the crawler. However, more sophisticated crawlers can be configured to discover many of these hidden links.
Collection SettingsThese settings allow the user to fine tune the behavior of thecrawler, and particularly to determine the content that iscollected. Filters can be defined to include or exclude certainpaths and file types:For Example: To exclude links to pop ups advertisements or tocollect only links to PDF files. Filters may also be used to avoidcrawlers traps, whereby the crawler becomes locked into anendless loop, by detecting repeating patterns of links. The usermay also be able to place limits on the maximum size of files tobe collected.
Storage SettingsThese settings determine how the crawler stores the collectedcontent. By default, most crawlers will mirror the originalstructure of the website, building a directory structure whichcorresponds to the original hierarchy. However it may be possibleto dictate other options, such as forcing all images to be stored ina single folder. These options are unlikely to be useful in mostweb archiving scenarios, where preservation of the originalstructure will be considered desirable. The crawler can rewritethe hyperlink to convert an absolute link into relative link.
Scheduling SettingsTools such as PANDAS, which provide workflow capabilities,allow the scheduling of crawls to be controlled. Typicalparameters will include:Frequency: Daily or WeeklyDates: Start or Commencement of processNon-schedule Dates: It may also be possible to define thespecific dates for crawling including the standard schedule.
Identifying the CrawlerSoftware agents such as web browsers and crawlers identifythemselves to the online services with which they connectthrough a ‘user agent’ identifier within the HTTP headers of therequests they send. Thus, internet explorer 6.0; identifies itselfwith the user agent Mozilla/4.0(compatiable; MSIE 6.0; windowsNT 5.1). The user agent string displayed by a web crawler cangenerally be modified by the user.
Advantages of IdentificationThere are three advantages for this identification..1- Crawler identify himself that from which institution it belongs to2- web servers may be configured to block certain user agents, including web crawlers and search engines robots. Defining a more specific user agent ca prevent such blocking, even if using a crawler that would otherwise be blocked.3- some websites are designed to display correctly only in certain browsers and check the user agents in any HTTP request accordingly. User agents which do not indicate correct browser compatibility will then be redirected to a warning page.
StrengthsThe greatest strengths of remote harvesting is- Ease of use- Flexibility- Widespread applicability- Availability of number of mature software tools.- A remote harvesting program can be established very quickly and allows large number of websites to be collected in a relatively short period.- The infrastructure requirements are relatively simple and it requires no active participation from website owners: the process is entirely in the control of the archiving body.- Most web crawlers software is comparatively straight forward to use, and can be operated by non-technical staff with some training.
Limitations- Careful Configuration- Inability to collect dynamic contents- The large volume of data can be archived with the maximum speed availability which is a draw back.
Transactional ArchivingTransactional archiving is a fundamentally different approach from anyof those previously described, being event driven rather than contentdriven.• Transactional archiving is an event-driven approach, which collects the actual transactions which take place between a web server and a web browser. It is primarily used as a means of preserving evidence of the content which was actually viewed on a particular website, on a given date. This may be particularly important for organizations which need to comply with legal or regulatory requirements for disclosing and retaining information.• A transactional archiving system typically operates by intercepting every HTTP request to, and response from, the web server, filtering each response to eliminate duplicate content, and permanently storing the responses as bitstreams. A transactional archiving system requires the installation of software on the web server, and cannot therefore be used to collect content from a remote website.
Example of Transactional Archive• pageVault supports the archiving of all unique responses generated by a web server.• It allows you to know exactly what information you have published on your web site, whether static pages or dynamically generated content, and regardless of format (HTML, XML, PDF, zip, Microsoft Office formats, images, sound), regardless of rate of change.• Although every unique HTTP response can be archived and indexed, you can define non-material content (such as the current date/time and trivial site personalisation) on a per- URL, directory or regular expression basis which pageVault will exclude when calculating the novelty of a response.
StrengthsThe great strength of transactional archiving is that itcollects what is actually viewed. It offers the best optionfor collecting evidence of how a website was used, andwhat content was actually available at any givenmoment. It can be a good solution for archiving certainkinds of dynamic website.
Limitations- The transactional collection does not collect content which has never been viewed by a user.- Transactional collection takes place on the web server, it cannot capture variations in the user experience which are introduced by the web browser.- Transactional archiving must takes place server side and therefore requires the active co operation of the web site owner.- The time taken for the server to process and respond to each request will be longer.