SlideShare a Scribd company logo
1 of 18
Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Indexing your web server(s)
Helen Varley Sargan
Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Why create an index?
• Helps users (and webmasters) to find things
• …but isn’t a substitute for good navigation
• Gives cohesion to a group of unrelated servers
• Observation of logs gives information on what
people are looking for - and what they are
having trouble finding
• You are already being part-indexed by many
search engines, unless you have taken specific
action against it
Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Current situation
Name Total
ht://Dig 25
Excite 19
Microsoft 12
Harvest 8
Ultraseek 7
SWISH 5
Webinator 4
Netscape 3
wwwwais 3
FreeFind 2
Other 13
None 59
Based on UKOLN survey
of search engines used in
160 UK HEIs carried out
in July/Aug 1999.
Report to be published in
Ariadne issue 21. See
<http://www.ariadne
.ac.uk/>.
Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Current situation questions
• Is the version of Muscat used by Surrey the
free version available for a time (but not any
more)?
• Are the users of Excite quite happy with the
security and that development seems to have
ceased?
• Are users of local search engines that don't
use robots.txt happy with what other search
engines can index on their sites (you have got
a robots.txt file haven't you?)
Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Types of tool
• External services are robots
• Tools you install yourself fall into two main
categories (some will work both ways)
– direct indexes of local and/or networked file
structure
– robot- or spider-based following instructions
from the robots.txt file on each web server
indexed
• The programs are either in a form you have to
compile yourself or are precompiled for your
OS, or they are written in Perl or Java, so will
need either Perl or Java runtime to function.
Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Controlling robot access 1
• All of our web servers are being part-indexed
by external robots
• Control of external robots and a local robot-
mediated indexer is by the same route
– a robots.txt file to give access
information
– Meta tags for robots in each HTML file
giving indexing and link-following entry or
exclusion
– Meta tags in each HTML file giving
description and keywords
• The first two controls are observed by all the
major search engines. Some search engines do
not observe description and keyword meta
tags.
Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Controlling robot access 2
• Some patchy support for Dublin Core metadata
• Access to branches of the server can be
limited by the server software - by combining
access control with metadata you can give
limited information to some users and more to
others.
• If you don’t want people to read files, either
password-protect that section of the server or
remove them. Limiting robot access to a
directory can make nosey users flock to look
what’s inside.
Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Security
• There has been a security problem with
indexing software (Excite free version in 1998)
• Remember the security of the OS the indexing
software is running under - keep all machines
up-to-date with security patches whether they
are causing trouble or not.
• Seek help with security if you are not an expert
in the OS, particularly with Unix or Windows
NT
Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
What tool to use? 1
• Find out if any money, hardware and/or staff
are available for the project first
• Make a shopping list of your requirements and
conditions
– hosting the index (where)?
– platform (available and desirable)?
– how many servers (and/or pages) will I
index?
– is the indexed data very dynamic?
– what types of files do I want indexed?
– what kind of search (keyword, phrase,
natural language, constrained)?
• Are you concerned how you are indexed by
others?
Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
What tool to use? 2
• Equipped with the answers to the previous
questions, you will be able to select a suitable
category of tool
• If you are concerned how others index your
site, install a local robot- or spider-based
indexer and look at indexer control measures
• Free externally hosted services for very small
needs
• Free tools (mainly Unix-based) for the
technically literate or built-in to some server
software
• Commercial tools cover a range of platforms
and pocket-depths but vary enormously in
features
Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Free externally hosted
services
• Will be limited to the number of pages indexed,
possibly the number of times the index is
access, and may be deleted if not used for a
certain number of days (5-7)
• Very useful for small sites and/or those with
little technical experience or resources
• Access is prey to Internet traffic (most services
are in US) and server availability, and for UK
users incoming transatlantic traffic will be
charged for
• You may have to have advertising on your
search page as a condition of use
Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Free tools - built in
• Microsoft, Netscape, WebStar, WebTen and
WebSite Pro all come with built in indexers
(others may too)
• With any or all of these there may be problems
indexing some other servers, since they are all
using vendor-specific APIs (they may receive
responses from other servers that they can’t
interpret). Problems are more likely with more
and varied server types being indexed
Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Free tools - installed
• Most active current development on SWISH
(both E and ++), Webglimpse, ht://Dig and
Alkaline
• Alkaline is a new product, all the others have
been through long periods of inactivity and all
are dependent on volunteer effort
• All of these are now robot based but may have
other means of looking at directories as well
• Alkaline is available on Windows NT, but all
the others are Unix. Some need to be
compiled.
Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Commercial tools
• Most have specialisms - sort out your
requirements very carefully before you select a
shortlist
• Real money price may vary from US$250 to
£10,000+ (possibly with additional yearly
maintenance), depending on product
• The cost of most will be on a sliding scale
depending on the size of index being used
• Bear in mind that Java-based tools will require
the user to be running a Java-enabled browser
Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Case Study 1 - Essex
Platform: Windows NT
Number of servers searched: 16
Number of entries: approx 11,500
File types indexed: Office files, html and txt. Filters available for other
formats
Index updating: Configured with windows task scheduler. Incremental
updates possible.
Constrained searches possible: Yes
Configuration: follows robots.txt but can take a 'back door' route as well.
Obeys robots meta tag
Logs and reports: Creates reports on crawling progress. Log analysis not
included but can be written as add-ons (asp scripts)
Pros: Free of charge with Windows NT.
Cons: Needs high level of Windows NT expertise to set up and run it
effectively. May run into problems indexing servers running diverse server
software. Not compatible with Microsoft Index server (a single server product).
Creates several catlog files, which may create network problems when indexing
many servers.
Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Case Study 2 - Oxford
Platform: Unix
Number of servers searched: 131
Number of entries: approx 43, 500 (specifically 9 levels down as a
maximum on any server)
File types indexed: Office files, html and txt. Filters available for other
formats
Index updating: Configured to reindex after a set time period. Incremental
updates possible.
Constrained searches possible: Yes but need to be configured on the
ht://Dig server
Configuration: follows robots.txt but can take a 'back door' route as
well.
Logs and reports: none generated in an obvious manner, but probably
available somehow.
Pros: Free of charge. Wide number of configuration options available.
Cons: Needs high level of Unix expertise to set up and run it effectively.
Index files are very large.
Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Case Study 3 - Cambridge
Platform: Unix
Number of servers searched: 232
Number of entries: approx 188,000
File types indexed: Many formats, including PDF, html and txt.
Index updating: Intelligent incremental reindexing dependent on the
frequency of file updates - can be given permitted schedule. Manual
incremental updates easily done.
Constrained searches possible: Yes easily configured by users and can
also be added to configuration as a known constrained search.
Configuration: follows robots.txt and meta tags. Configurable weighting
given to terms in title and meta tags. Thesaurus add-on available to give user-
controlled alternatives
Logs and reports: Logs and reports available for every aspect of use -
search terms, number of terms, servers searched, etc.
Pros: Very easy to install and maintain. Gives extremely good results in a
problematic environment. Technical support excellent.
Cons: Relatively expensive.
Institutional Webmasters Workshop 7-9 September 1999
University of
Cambridge
Computing Service
Recommendations
• Choosing an appropriate search engine is
wholly dependent on your particular needs and
circumstances
• Sort out all your robot-based indexing controls
when you install your local indexer
• Do review your indexing software regularly - if
it’s trouble free it still needs maintaining

More Related Content

Viewers also liked (6)

Active learning classroom
Active learning classroomActive learning classroom
Active learning classroom
 
Огляд банківського сектору України (жовтень 2016 року)
Огляд банківського сектору України (жовтень 2016 року)Огляд банківського сектору України (жовтень 2016 року)
Огляд банківського сектору України (жовтень 2016 року)
 
Estrategia Europa 2020
Estrategia Europa 2020Estrategia Europa 2020
Estrategia Europa 2020
 
Guide "Road transport safety management systems"
Guide "Road transport safety management systems"Guide "Road transport safety management systems"
Guide "Road transport safety management systems"
 
Building a digital team (almost) from scratch
Building a digital team (almost) from scratchBuilding a digital team (almost) from scratch
Building a digital team (almost) from scratch
 
One minute manager
One minute managerOne minute manager
One minute manager
 

Similar to IWMW 1999: Indexing your web server

Dd13.2013.milano.open ntf
Dd13.2013.milano.open ntfDd13.2013.milano.open ntf
Dd13.2013.milano.open ntf
Ulrich Krause
 
Build automation best practices
Build automation best practicesBuild automation best practices
Build automation best practices
Code Mastery
 
Hybrid Automation Framework Developement
Hybrid Automation Framework DevelopementHybrid Automation Framework Developement
Hybrid Automation Framework Developement
Glasdon Falcao
 

Similar to IWMW 1999: Indexing your web server (20)

IWMW 1999: Browser management
IWMW 1999: Browser managementIWMW 1999: Browser management
IWMW 1999: Browser management
 
Lecture 10
Lecture 10Lecture 10
Lecture 10
 
The Latest and Greatest from OpenNTF and the IBM Social Business Toolkit, #dd13
The Latest and Greatest from OpenNTF and the IBM Social Business Toolkit, #dd13The Latest and Greatest from OpenNTF and the IBM Social Business Toolkit, #dd13
The Latest and Greatest from OpenNTF and the IBM Social Business Toolkit, #dd13
 
Effective admin and development in iib
Effective admin and development in iibEffective admin and development in iib
Effective admin and development in iib
 
Php Web Frameworks
Php Web FrameworksPhp Web Frameworks
Php Web Frameworks
 
Dd13.2013.milano.open ntf
Dd13.2013.milano.open ntfDd13.2013.milano.open ntf
Dd13.2013.milano.open ntf
 
Build automation best practices
Build automation best practicesBuild automation best practices
Build automation best practices
 
Application of Library Management Software: NewGenLib
Application of Library Management Software: NewGenLibApplication of Library Management Software: NewGenLib
Application of Library Management Software: NewGenLib
 
Optimus XPages: An Explosion of Techniques and Best Practices
Optimus XPages: An Explosion of Techniques and Best PracticesOptimus XPages: An Explosion of Techniques and Best Practices
Optimus XPages: An Explosion of Techniques and Best Practices
 
Introduction to InfluxDB 2.0 & Your First Flux Query by Sonia Gupta, Develope...
Introduction to InfluxDB 2.0 & Your First Flux Query by Sonia Gupta, Develope...Introduction to InfluxDB 2.0 & Your First Flux Query by Sonia Gupta, Develope...
Introduction to InfluxDB 2.0 & Your First Flux Query by Sonia Gupta, Develope...
 
Open Audit
Open AuditOpen Audit
Open Audit
 
Crime Reporting System.pptx
Crime Reporting System.pptxCrime Reporting System.pptx
Crime Reporting System.pptx
 
Case study
Case studyCase study
Case study
 
IWMW 2002: Web standards briefing (session C2)
IWMW 2002: Web standards briefing (session C2)IWMW 2002: Web standards briefing (session C2)
IWMW 2002: Web standards briefing (session C2)
 
python project ppt.pptx
python project ppt.pptxpython project ppt.pptx
python project ppt.pptx
 
Database project
Database projectDatabase project
Database project
 
Hybrid Automation Framework Developement
Hybrid Automation Framework DevelopementHybrid Automation Framework Developement
Hybrid Automation Framework Developement
 
Spm file33
Spm file33Spm file33
Spm file33
 
TechTalk: Connext DDS 5.2.
TechTalk: Connext DDS 5.2.TechTalk: Connext DDS 5.2.
TechTalk: Connext DDS 5.2.
 
Web Tools for GemStone/S
Web Tools for GemStone/SWeb Tools for GemStone/S
Web Tools for GemStone/S
 

More from IWMW

More from IWMW (20)

Look who's talking now
Look who's talking nowLook who's talking now
Look who's talking now
 
Introduction to IWMW 2000 (Liz Lyon)
Introduction to IWMW 2000 (Liz Lyon)Introduction to IWMW 2000 (Liz Lyon)
Introduction to IWMW 2000 (Liz Lyon)
 
Web Tools report
Web Tools reportWeb Tools report
Web Tools report
 
Personal Contingency Plan - Beat The Panic
Personal Contingency Plan - Beat The PanicPersonal Contingency Plan - Beat The Panic
Personal Contingency Plan - Beat The Panic
 
Whose site is it anyway?
Whose site is it anyway?Whose site is it anyway?
Whose site is it anyway?
 
Open Source - the case against
Open Source - the case againstOpen Source - the case against
Open Source - the case against
 
IWMW 2002: Avoiding Portal Wars - an MIS view
IWMW 2002: Avoiding Portal Wars - an MIS viewIWMW 2002: Avoiding Portal Wars - an MIS view
IWMW 2002: Avoiding Portal Wars - an MIS view
 
What does open source mean for the institutional web manager?
What does open source mean for the institutional web manager?What does open source mean for the institutional web manager?
What does open source mean for the institutional web manager?
 
Library 2.0
Library 2.0Library 2.0
Library 2.0
 
Social participation in student recruitment
Social participation in student recruitmentSocial participation in student recruitment
Social participation in student recruitment
 
Supporting Institutions in Changing Times: Manifesto
Supporting Institutions in Changing Times: ManifestoSupporting Institutions in Changing Times: Manifesto
Supporting Institutions in Changing Times: Manifesto
 
IWMW 2019 photo scavenger hunt highlights
IWMW 2019 photo scavenger hunt highlightsIWMW 2019 photo scavenger hunt highlights
IWMW 2019 photo scavenger hunt highlights
 
How to Turn a Web Strategy into Web Services
How to Turn a Web Strategy into Web ServicesHow to Turn a Web Strategy into Web Services
How to Turn a Web Strategy into Web Services
 
Static Site Generators - Developing Websites in Low-resource Condition
Static Site Generators - Developing Websites in Low-resource ConditionStatic Site Generators - Developing Websites in Low-resource Condition
Static Site Generators - Developing Websites in Low-resource Condition
 
Looking to the Future
Looking to the FutureLooking to the Future
Looking to the Future
 
Looking to the Future
Looking to the FutureLooking to the Future
Looking to the Future
 
Developing Communities of Practice
Developing Communities of PracticeDeveloping Communities of Practice
Developing Communities of Practice
 
How to train your content- so it doesn't slow you down...
How to train your content- so it doesn't slow you down... How to train your content- so it doesn't slow you down...
How to train your content- so it doesn't slow you down...
 
Grassroots & Guerrillas: The Beginnings of a UX Revolution
Grassroots & Guerrillas: The Beginnings of a UX RevolutionGrassroots & Guerrillas: The Beginnings of a UX Revolution
Grassroots & Guerrillas: The Beginnings of a UX Revolution
 
Connecting Your Content: How to Save Time and Improve Content Quality through...
Connecting Your Content: How to Save Time and Improve Content Quality through...Connecting Your Content: How to Save Time and Improve Content Quality through...
Connecting Your Content: How to Save Time and Improve Content Quality through...
 

Recently uploaded

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 

Recently uploaded (20)

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 

IWMW 1999: Indexing your web server

  • 1. Institutional Webmasters Workshop 7-9 September 1999 University of Cambridge Computing Service Indexing your web server(s) Helen Varley Sargan
  • 2. Institutional Webmasters Workshop 7-9 September 1999 University of Cambridge Computing Service Why create an index? • Helps users (and webmasters) to find things • …but isn’t a substitute for good navigation • Gives cohesion to a group of unrelated servers • Observation of logs gives information on what people are looking for - and what they are having trouble finding • You are already being part-indexed by many search engines, unless you have taken specific action against it
  • 3. Institutional Webmasters Workshop 7-9 September 1999 University of Cambridge Computing Service Current situation Name Total ht://Dig 25 Excite 19 Microsoft 12 Harvest 8 Ultraseek 7 SWISH 5 Webinator 4 Netscape 3 wwwwais 3 FreeFind 2 Other 13 None 59 Based on UKOLN survey of search engines used in 160 UK HEIs carried out in July/Aug 1999. Report to be published in Ariadne issue 21. See <http://www.ariadne .ac.uk/>.
  • 4. Institutional Webmasters Workshop 7-9 September 1999 University of Cambridge Computing Service Current situation questions • Is the version of Muscat used by Surrey the free version available for a time (but not any more)? • Are the users of Excite quite happy with the security and that development seems to have ceased? • Are users of local search engines that don't use robots.txt happy with what other search engines can index on their sites (you have got a robots.txt file haven't you?)
  • 5. Institutional Webmasters Workshop 7-9 September 1999 University of Cambridge Computing Service Types of tool • External services are robots • Tools you install yourself fall into two main categories (some will work both ways) – direct indexes of local and/or networked file structure – robot- or spider-based following instructions from the robots.txt file on each web server indexed • The programs are either in a form you have to compile yourself or are precompiled for your OS, or they are written in Perl or Java, so will need either Perl or Java runtime to function.
  • 6. Institutional Webmasters Workshop 7-9 September 1999 University of Cambridge Computing Service Controlling robot access 1 • All of our web servers are being part-indexed by external robots • Control of external robots and a local robot- mediated indexer is by the same route – a robots.txt file to give access information – Meta tags for robots in each HTML file giving indexing and link-following entry or exclusion – Meta tags in each HTML file giving description and keywords • The first two controls are observed by all the major search engines. Some search engines do not observe description and keyword meta tags.
  • 7. Institutional Webmasters Workshop 7-9 September 1999 University of Cambridge Computing Service Controlling robot access 2 • Some patchy support for Dublin Core metadata • Access to branches of the server can be limited by the server software - by combining access control with metadata you can give limited information to some users and more to others. • If you don’t want people to read files, either password-protect that section of the server or remove them. Limiting robot access to a directory can make nosey users flock to look what’s inside.
  • 8. Institutional Webmasters Workshop 7-9 September 1999 University of Cambridge Computing Service Security • There has been a security problem with indexing software (Excite free version in 1998) • Remember the security of the OS the indexing software is running under - keep all machines up-to-date with security patches whether they are causing trouble or not. • Seek help with security if you are not an expert in the OS, particularly with Unix or Windows NT
  • 9. Institutional Webmasters Workshop 7-9 September 1999 University of Cambridge Computing Service What tool to use? 1 • Find out if any money, hardware and/or staff are available for the project first • Make a shopping list of your requirements and conditions – hosting the index (where)? – platform (available and desirable)? – how many servers (and/or pages) will I index? – is the indexed data very dynamic? – what types of files do I want indexed? – what kind of search (keyword, phrase, natural language, constrained)? • Are you concerned how you are indexed by others?
  • 10. Institutional Webmasters Workshop 7-9 September 1999 University of Cambridge Computing Service What tool to use? 2 • Equipped with the answers to the previous questions, you will be able to select a suitable category of tool • If you are concerned how others index your site, install a local robot- or spider-based indexer and look at indexer control measures • Free externally hosted services for very small needs • Free tools (mainly Unix-based) for the technically literate or built-in to some server software • Commercial tools cover a range of platforms and pocket-depths but vary enormously in features
  • 11. Institutional Webmasters Workshop 7-9 September 1999 University of Cambridge Computing Service Free externally hosted services • Will be limited to the number of pages indexed, possibly the number of times the index is access, and may be deleted if not used for a certain number of days (5-7) • Very useful for small sites and/or those with little technical experience or resources • Access is prey to Internet traffic (most services are in US) and server availability, and for UK users incoming transatlantic traffic will be charged for • You may have to have advertising on your search page as a condition of use
  • 12. Institutional Webmasters Workshop 7-9 September 1999 University of Cambridge Computing Service Free tools - built in • Microsoft, Netscape, WebStar, WebTen and WebSite Pro all come with built in indexers (others may too) • With any or all of these there may be problems indexing some other servers, since they are all using vendor-specific APIs (they may receive responses from other servers that they can’t interpret). Problems are more likely with more and varied server types being indexed
  • 13. Institutional Webmasters Workshop 7-9 September 1999 University of Cambridge Computing Service Free tools - installed • Most active current development on SWISH (both E and ++), Webglimpse, ht://Dig and Alkaline • Alkaline is a new product, all the others have been through long periods of inactivity and all are dependent on volunteer effort • All of these are now robot based but may have other means of looking at directories as well • Alkaline is available on Windows NT, but all the others are Unix. Some need to be compiled.
  • 14. Institutional Webmasters Workshop 7-9 September 1999 University of Cambridge Computing Service Commercial tools • Most have specialisms - sort out your requirements very carefully before you select a shortlist • Real money price may vary from US$250 to £10,000+ (possibly with additional yearly maintenance), depending on product • The cost of most will be on a sliding scale depending on the size of index being used • Bear in mind that Java-based tools will require the user to be running a Java-enabled browser
  • 15. Institutional Webmasters Workshop 7-9 September 1999 University of Cambridge Computing Service Case Study 1 - Essex Platform: Windows NT Number of servers searched: 16 Number of entries: approx 11,500 File types indexed: Office files, html and txt. Filters available for other formats Index updating: Configured with windows task scheduler. Incremental updates possible. Constrained searches possible: Yes Configuration: follows robots.txt but can take a 'back door' route as well. Obeys robots meta tag Logs and reports: Creates reports on crawling progress. Log analysis not included but can be written as add-ons (asp scripts) Pros: Free of charge with Windows NT. Cons: Needs high level of Windows NT expertise to set up and run it effectively. May run into problems indexing servers running diverse server software. Not compatible with Microsoft Index server (a single server product). Creates several catlog files, which may create network problems when indexing many servers.
  • 16. Institutional Webmasters Workshop 7-9 September 1999 University of Cambridge Computing Service Case Study 2 - Oxford Platform: Unix Number of servers searched: 131 Number of entries: approx 43, 500 (specifically 9 levels down as a maximum on any server) File types indexed: Office files, html and txt. Filters available for other formats Index updating: Configured to reindex after a set time period. Incremental updates possible. Constrained searches possible: Yes but need to be configured on the ht://Dig server Configuration: follows robots.txt but can take a 'back door' route as well. Logs and reports: none generated in an obvious manner, but probably available somehow. Pros: Free of charge. Wide number of configuration options available. Cons: Needs high level of Unix expertise to set up and run it effectively. Index files are very large.
  • 17. Institutional Webmasters Workshop 7-9 September 1999 University of Cambridge Computing Service Case Study 3 - Cambridge Platform: Unix Number of servers searched: 232 Number of entries: approx 188,000 File types indexed: Many formats, including PDF, html and txt. Index updating: Intelligent incremental reindexing dependent on the frequency of file updates - can be given permitted schedule. Manual incremental updates easily done. Constrained searches possible: Yes easily configured by users and can also be added to configuration as a known constrained search. Configuration: follows robots.txt and meta tags. Configurable weighting given to terms in title and meta tags. Thesaurus add-on available to give user- controlled alternatives Logs and reports: Logs and reports available for every aspect of use - search terms, number of terms, servers searched, etc. Pros: Very easy to install and maintain. Gives extremely good results in a problematic environment. Technical support excellent. Cons: Relatively expensive.
  • 18. Institutional Webmasters Workshop 7-9 September 1999 University of Cambridge Computing Service Recommendations • Choosing an appropriate search engine is wholly dependent on your particular needs and circumstances • Sort out all your robot-based indexing controls when you install your local indexer • Do review your indexing software regularly - if it’s trouble free it still needs maintaining