The document discusses options for creating an index of web servers to help users find information. It provides details on current popular search engines and considerations for choosing an indexing tool. Free options include built-in tools or standalone programs requiring technical expertise, while commercial tools vary in features and price. Case studies demonstrate installations using different platforms and tools at various universities. The document recommends determining requirements and controls before selecting a suitable indexing solution.
1. Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Indexing your web server(s)
Helen Varley Sargan
2. Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Why create an index?
• Helps users (and webmasters) to find things
• …but isn’t a substitute for good navigation
• Gives cohesion to a group of unrelated servers
• Observation of logs gives information on what
people are looking for - and what they are
having trouble finding
• You are already being part-indexed by many
search engines, unless you have taken specific
action against it
3. Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Current situation
Name Total
ht://Dig 25
Excite 19
Microsoft 12
Harvest 8
Ultraseek 7
SWISH 5
Webinator 4
Netscape 3
wwwwais 3
FreeFind 2
Other 13
None 59
Based on UKOLN survey
of search engines used in
160 UK HEIs carried out
in July/Aug 1999.
Report to be published in
Ariadne issue 21. See
<http://www.ariadne
.ac.uk/>.
4. Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Current situation questions
• Is the version of Muscat used by Surrey the
free version available for a time (but not any
more)?
• Are the users of Excite quite happy with the
security and that development seems to have
ceased?
• Are users of local search engines that don't
use robots.txt happy with what other search
engines can index on their sites (you have got
a robots.txt file haven't you?)
5. Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Types of tool
• External services are robots
• Tools you install yourself fall into two main
categories (some will work both ways)
– direct indexes of local and/or networked file
structure
– robot- or spider-based following instructions
from the robots.txt file on each web server
indexed
• The programs are either in a form you have to
compile yourself or are precompiled for your
OS, or they are written in Perl or Java, so will
need either Perl or Java runtime to function.
6. Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Controlling robot access 1
• All of our web servers are being part-indexed
by external robots
• Control of external robots and a local robot-
mediated indexer is by the same route
– a robots.txt file to give access
information
– Meta tags for robots in each HTML file
giving indexing and link-following entry or
exclusion
– Meta tags in each HTML file giving
description and keywords
• The first two controls are observed by all the
major search engines. Some search engines do
not observe description and keyword meta
tags.
7. Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Controlling robot access 2
• Some patchy support for Dublin Core metadata
• Access to branches of the server can be
limited by the server software - by combining
access control with metadata you can give
limited information to some users and more to
others.
• If you don’t want people to read files, either
password-protect that section of the server or
remove them. Limiting robot access to a
directory can make nosey users flock to look
what’s inside.
8. Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Security
• There has been a security problem with
indexing software (Excite free version in 1998)
• Remember the security of the OS the indexing
software is running under - keep all machines
up-to-date with security patches whether they
are causing trouble or not.
• Seek help with security if you are not an expert
in the OS, particularly with Unix or Windows
NT
9. Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
What tool to use? 1
• Find out if any money, hardware and/or staff
are available for the project first
• Make a shopping list of your requirements and
conditions
– hosting the index (where)?
– platform (available and desirable)?
– how many servers (and/or pages) will I
index?
– is the indexed data very dynamic?
– what types of files do I want indexed?
– what kind of search (keyword, phrase,
natural language, constrained)?
• Are you concerned how you are indexed by
others?
10. Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
What tool to use? 2
• Equipped with the answers to the previous
questions, you will be able to select a suitable
category of tool
• If you are concerned how others index your
site, install a local robot- or spider-based
indexer and look at indexer control measures
• Free externally hosted services for very small
needs
• Free tools (mainly Unix-based) for the
technically literate or built-in to some server
software
• Commercial tools cover a range of platforms
and pocket-depths but vary enormously in
features
11. Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Free externally hosted
services
• Will be limited to the number of pages indexed,
possibly the number of times the index is
access, and may be deleted if not used for a
certain number of days (5-7)
• Very useful for small sites and/or those with
little technical experience or resources
• Access is prey to Internet traffic (most services
are in US) and server availability, and for UK
users incoming transatlantic traffic will be
charged for
• You may have to have advertising on your
search page as a condition of use
12. Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Free tools - built in
• Microsoft, Netscape, WebStar, WebTen and
WebSite Pro all come with built in indexers
(others may too)
• With any or all of these there may be problems
indexing some other servers, since they are all
using vendor-specific APIs (they may receive
responses from other servers that they can’t
interpret). Problems are more likely with more
and varied server types being indexed
13. Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Free tools - installed
• Most active current development on SWISH
(both E and ++), Webglimpse, ht://Dig and
Alkaline
• Alkaline is a new product, all the others have
been through long periods of inactivity and all
are dependent on volunteer effort
• All of these are now robot based but may have
other means of looking at directories as well
• Alkaline is available on Windows NT, but all
the others are Unix. Some need to be
compiled.
14. Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Commercial tools
• Most have specialisms - sort out your
requirements very carefully before you select a
shortlist
• Real money price may vary from US$250 to
£10,000+ (possibly with additional yearly
maintenance), depending on product
• The cost of most will be on a sliding scale
depending on the size of index being used
• Bear in mind that Java-based tools will require
the user to be running a Java-enabled browser
15. Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Case Study 1 - Essex
Platform: Windows NT
Number of servers searched: 16
Number of entries: approx 11,500
File types indexed: Office files, html and txt. Filters available for other
formats
Index updating: Configured with windows task scheduler. Incremental
updates possible.
Constrained searches possible: Yes
Configuration: follows robots.txt but can take a 'back door' route as well.
Obeys robots meta tag
Logs and reports: Creates reports on crawling progress. Log analysis not
included but can be written as add-ons (asp scripts)
Pros: Free of charge with Windows NT.
Cons: Needs high level of Windows NT expertise to set up and run it
effectively. May run into problems indexing servers running diverse server
software. Not compatible with Microsoft Index server (a single server product).
Creates several catlog files, which may create network problems when indexing
many servers.
16. Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Case Study 2 - Oxford
Platform: Unix
Number of servers searched: 131
Number of entries: approx 43, 500 (specifically 9 levels down as a
maximum on any server)
File types indexed: Office files, html and txt. Filters available for other
formats
Index updating: Configured to reindex after a set time period. Incremental
updates possible.
Constrained searches possible: Yes but need to be configured on the
ht://Dig server
Configuration: follows robots.txt but can take a 'back door' route as
well.
Logs and reports: none generated in an obvious manner, but probably
available somehow.
Pros: Free of charge. Wide number of configuration options available.
Cons: Needs high level of Unix expertise to set up and run it effectively.
Index files are very large.
17. Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Case Study 3 - Cambridge
Platform: Unix
Number of servers searched: 232
Number of entries: approx 188,000
File types indexed: Many formats, including PDF, html and txt.
Index updating: Intelligent incremental reindexing dependent on the
frequency of file updates - can be given permitted schedule. Manual
incremental updates easily done.
Constrained searches possible: Yes easily configured by users and can
also be added to configuration as a known constrained search.
Configuration: follows robots.txt and meta tags. Configurable weighting
given to terms in title and meta tags. Thesaurus add-on available to give user-
controlled alternatives
Logs and reports: Logs and reports available for every aspect of use -
search terms, number of terms, servers searched, etc.
Pros: Very easy to install and maintain. Gives extremely good results in a
problematic environment. Technical support excellent.
Cons: Relatively expensive.
18. Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Recommendations
• Choosing an appropriate search engine is
wholly dependent on your particular needs and
circumstances
• Sort out all your robot-based indexing controls
when you install your local indexer
• Do review your indexing software regularly - if
it’s trouble free it still needs maintaining