The Development of Web Archiving Dr. Essam Obaid
Definition of Web Archiving “Web archiving is the process of collecting portions of the World Wide Web and ensuring the collection is preserved in an archive”such as an archive site, for future researchers, historians, and the public. Due to the massive size of the Web, web archivists typically employ web crawlers for automated collection. The largest web archiving organization based on a crawling approach is the Internet Archive which strives to maintain an archive of the entire Web. National libraries, national archives and various consortia of organizations are also involved in archiving culturally important Web content. Commercial web archiving software and services are also available toorganizations who need to archive their own web content for corporate heritage, regulatory, or legal purposes.
Web CrawlersA Web crawler is a computer program that browses the World Wide Web in amethodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, Web scutters.• This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.• Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses.• A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.
WHAT IS A SEARCH ENGINE “In a search engines, such as Google and HotBot, consist of a software package that crawls the Web, extracts and organizes the data in a database. People can then submit a search query using a Web browser. The search engine locates the appropriate data in the database and displays it via the browser” Search engines have three major elements:• The spider, also called the crawler, harvester, robot or gatherer. The spider visits a Web page, reads it, and then follows links to other pages within the site. The spider returns to the site on a regular basis, such as every month or two, to look for changes.• The Index. Everything the spider finds goes into the index. The index, is like a giant book containing a copy of every web page that the spider finds. If a web page changes, then this book is updated with new information.• Search engine software. This is the program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant. Search engine software is also available to run on a local Web site. The software has the same basic components, but the spider just visits the local site or a limited number of sites in a community.
Web Crawler BehaviorThe behavior of a Web crawler is the outcome of a combination ofpolicies:• A selection policy that states which pages to download,• A re-visit policy that states when to check for changes to the pages,• A politeness policy that states how to avoid overloading Web sites, and• A parallelization policy that states how to coordinate distributed Web crawlers.
High Level Architecture of a Web CrawlerWeb crawlers are a central part of search engines, and details on their algorithms and architecture are kept as business secrets
Internet Archive“The Internet Archive is a non-profit digital library with the stated mission of "universal access to all knowledge. It offers permanent storage and access to collections of digitized materials, including websites, music, moving images, and books. The Internet Archive was founded by Brewster Kahle in 1996”• With offices located in San Francisco, California, USA and data centers in San Francisco, Redwood City, and Mountain View, California, USA, the Archives largest collection is its web archive, "snapshots of the World Wide Web.“• The Archive allows the public to both upload and download digital material to its data cluster, and provides unrestricted online access to that material at no cost. The Archive also oversees one of the worlds largest book digitization projects. It is a member of the American Library Association and is officially recognized by the State of California as a library.
Brewster Kahle founded the Archive in 1996 at the same time that he beganthe for-profit web crawling company Alexa Internet. The Archive began toarchive the World Wide Web from 1996, but it did not make this collectionavailable until 2001, when it developed the Wayback Machine. Now theInternet Archive includes texts, audio, moving images, and software. It hosts anumber of other projects: the NASA Images Archive, the contract crawlingservice Archive-It, and the wiki-editable library catalog and book informationsite Open Library.According to its website: – Most societies place importance on preserving artifacts of their culture and heritage. Without such artifacts, civilization has no memory and no mechanism to learn from its successes and failures. Our culture now produces more and more artifacts in digital form. The Archives mission is to help preserve those artifacts and create an Internet library for researchers, historians, and scholars.
Wayback MachineThe Internet Archive has "Wayback Machine" for its service that allows archives of the World Wide Web to be searched and accessed. This service allows users to see archived versions of web pages of the past. Millions of websites and their associated data (images, source code, documents, etc.) are saved in a gigantic database. The service can be used to see what previous versions of websites used to look like, to grab original source code from websites that may no longer be directly available, or to visit websites that no longer even exist. Not all websites are available, however, because many website owners choose to exclude their sites.
Web Archiving TechniquesThe most common web archiving technique uses web crawlers to automate theprocess of collecting web pages. Web crawlers typically view web pages in thesame manner that users with a browser see the Web, and therefore provide acomparatively simple method of remotely harvesting web content. Examplesof web crawlers frequently used for web archiving include:• Automated Internet Sessions in biterScripting• Heritrix• HTTrack• Wget
Heritrix• Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open- source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.• Heritrix was developed jointly by Internet Archive and the Nordic national libraries on specifications written in early 2003. The first official release was in January 2004, and it has been continually improved by employees of the Internet Archive and other interested parties.
Organization using HeritrixA number of organizations and national libraries areusing Heritrix, among them:- Bibliothèque nationale de France- British Library- National Library of Finland- National Library of Newzeland
Bibliothèque Nationale de France The Bibliothèque nationale de France (BnF) is the National Library of France, located in Paris. It isintended to be the repository of all that is published in France. The current president of the library is Bruno Racine.
British Library The British Library is the national library of the United Kingdom, and one of the worlds largest libraries in terms oftotal number of items. The library is a major research library, holding over 150 million items from every country in theworld, in virtually all known languages and in many formats, both print and digital: books, manuscripts, journals,newspapers, magazines, sound and music recordings, videos, play-scripts, patents, databases, maps, stamps, prints,drawings. The Librarys collections include around 14 million books.
ARC FILE• Heritrix by default stores the web resources it crawls in an Arc file.This format has been used by the Internet Archive since 1996 to store its web archives. The WARC file format, similar to ARC but more precisely specified and flexible, can also be used. Heritrix can also be configured to store files in a directory format similar to the Wget crawler that uses the URL to name the directory and filename of each resource.• An Arc file stores multiple archived resources in a single file in order to avoid managing a large number of small files. The file consists of a sequence of URL records, each with a header containing metadata about how the resource was requested followed by the HTTP header and the response.Example:• filedesc://IA-2006062.arc 0.0.0.0 20060622190110 text/plain 76 1 1 InternetArchive URL IP-address Archive-date Content-type Archive- length• http://foo.edu:80/hello.html 127.10.100.2 19961104142103 text/html 187 HTTP/1.1 200 OK Date: Thu, 22 Jun 2006 19:01:15 GMT Server: Apache Last-Modified: Sat, 10 Jun 2006 22:33:11 GMT Content-Length: 30 Content-Type: text/html <html> Hello World!!! </html>
Screenshot of Heritrix Admin ConsoleStable release 3.0.0 / December 5, 2009; 14 months ago)Written in JavaOperating system Linux/Unix-like/Windows(unsupported)Type Web crawlerLicense GNU Lesser General Public LicenseWebsite http://crawler.archive.
Database ArchiveDatabase archiving refers to methods for archiving the underlying content ofdatabase-driven websites. It typically requires the extraction of the databasecontent into a standard schema, often using XML. Once stored in that standardformat, the archived content of multiple databases can then be made availableusing a single access system. Transactional ArchivingTransactional archiving is an event-driven approach, which collects the actualtransactions which take place between a web server and a web browser. It is primarilyused as a means of preserving evidence of the content which was actually viewed on aparticular website, on a given date. This may be particularly important for organizationswhich need to comply with legal or regulatory requirements for disclosing and retaininginformation.A transactional archiving system typically operates by intercepting every HTTP requestto, and response from, the web server, filtering each response to eliminate duplicatecontent, and permanently storing the responses as bit streams.
IIPCInternational Internet Preservation Consortium is an international organization oflibraries to coordinate efforts to preserve internet content for the future. Membership is open to archives, museums, libraries, and cultural heritage institutions.Its membership includes• Austrian National Library,• Biblioteka Narodowa,• Bibliothèque et Archives nationales du Québec,• Bibliothèque nationale de France,• British Library,• California Digital Library,• Clementinum,• German National Library,• Institut national de laudiovisuel,• Internet Archive,• Koninklijke Bibliotheek, National Library of the Netherlands, Library and Archives Canada, National and University Library in Zagreb, National and University Library of Iceland, National and University Library of Slovenia, National Diet Library, National Library Board, National Library of Australia, National Library of Catalonia, National Library of China, National Library of Finland, National Library of Israel, National Library of Korea, National Library of New Zealand, National Library of Norway, National Library of Poland, National Library of Scotland, National Library of Sweden, Royal Netherlands Academy of Arts and Sciences, Swiss National Library, The National Archives, United States Government Printing Office, and WebCite
Pandora Archive• PANDORA - Australias Web Archive is the national web archive for the preservation of Australias online publications. It was established by the National Library of Australia in 1996, and is now built in collaboration with a number of other Australian state libraries and cultural collecting organization, including the Australian Institute of Aboriginal and Torres Strait Islander Studies, the Australian War Memorial, and the National Film and Sound Archive. • The PANDORA Archive collects selected Australian web resources, preserves them, and makes them available for viewing. Access to the archive is made available to the public via the Pandora web site. Web sites are selected based on their cultural significance and research value in the long term.
Difficulties and Limitations Crawlers Web archives which rely on web crawling as their primary means of collecting the Web are influenced by the difficulties of web crawling:However, it is important to note that a native format web archive, i.e. a fully browsable web archive, with working links, media, etc., is only really possible using crawler technology. The Web is so large that crawling a significant portion of it takes a large amount of technical resources. The Web is changing so fast that portions of a website may change before a crawler has even finished crawling it.
Difficulties and Limitations General limitations Not only must web archivists deal with the technical challenges of web archiving, they must also contend with intellectual property laws. Peter Lyman states that "although the Web is popularly regarded as a public domain resource, it is copyrighted; thus, archivists have no legal right to copy the Web". However national libraries in many countries do have a legal right to copy portions of the web under an extension of a legal deposit.Some private non-profit web archives that are made publicly accessible like WebCite or the Internet Archive allow content owners to hide or remove archived content that they do not want the public to have access to. Other web archives are only accessible from certain locations or have regulated usage.