search

Introduction to Informatics - Fall 02
Accessing digital information: Search engines
I.The problem
• How to find the needle in the haystack
II. How search engines work
• Building the index
IV. Types of search engines
V. Problems with search engines

The Problem
The WWW contains more than 2.5 billion pages with 7.3
million pages added each day
The surface Web contains 19 terabytes (trillions of
bytes)
This where most of our stuff is
There are 7,500 terabytes hidden in the "deep" Web
This is largely proprietary information, dynamically
generated pages, or pages behind firewwalls
http://www.howstuffworks.com/news-item127.htm
Without the URLs of the particular pages you want, you
must rely on search engines to uncover potentially
relevant pages

The web lacks bibliographic control standards that we
take for granted in the print world
There is no equivalent to the ISBN to uniquely identify a
document
There is no standard system of cataloguing or
classification, analogous those developed by the
Library of Congress,
There is no central catalogue of the Web's holdings
Many documents lack the name of the author and the
date of publication
Updating? Version control? Not likely!

The net is not a digital library
It was not designed to support the organized
publication and retrieval of information
It has become a chaotic repository for the output of the
hundreds of thousands of digital publishers
It is filled with an amazing variety of digital artifacts
The ephemeral mixes everywhere with works of lasting
importance
The librarian’s classification and selection skills must be
complemented by the computer scientist’s ability to
automate the task of indexing, storing, and providing
access to information
Lynch (1999)
http://www.sciam.com/0397issue/0397lynch.html

How can we find what we want when we want it?
This is becoming an increasingly important problem for
people in the information professions
Strategies:
Follow links from page to page, hoping that you will
stumble across the pages that will help you answer the
question
Maintain a personal collection of bookmarks
Use search engines
~85% of all web sessions begin with or involve a
search

Search engines are a critical web tool
The Web offers the choice of hundreds of different
search tools
The problem is that each has its own database,
command language, search capabilities, and method of
displaying results
Each covers a different portion of the web with some
overlap
This means that you have to learn a variety of search
tools and develop effective search techniques to take
advantage of Web resources

Search engine coverage relative to the estimated size of the
publicly indexable web has decreased substantially since
December 97
No engine indexes more than about 16% of the web
Combined coverage of eleven major search engines is 42%
of the Web
Overlap between individual search engines is low, with
approximately 40% of a given search engine’s content
unique
The biggest search engines are currently FAST and
Google, with over 600 million pages indexed
http://healthlinks.washington.edu/hsl/liaisons/stanna/navweb/nav2.html

Search engines are typically more likely to index sites
that have more links to them (more 'popular' sites)
Google does this
Google interprets a link from page A to page B as a
vote, by page A, for page B
But, Google looks at more than the sheer volume of
votes, or links a page receives
It also analyzes the page that casts the vote
Votes cast by pages that are themselves
“important” weigh more heavily and help to make
other pages “important”
http://www.google.com/technology/index.html

Other difficulties with search engine coverage
Search engines are more likely to index US sites than non-
US sites (AltaVista is an exception), and more likely to
index .com sites than .edu sites
Indexing of new or modified pages by just one of the major
search engines can take months
The pages that one engine indexes do not have extensive
overlap with other indexes’ databases
Lawrence and Giles (1999)
http://www.wwwmetrics.com/

85% of users use search engines to find information (GVU
survey)
We use search engines to locate and buy goods and
research many decisions
Search engines are currently lacking in timeliness and
comprehensiveness and do not index sites equally
The current state of search engines can be compared to a
phone book which is
Updated irregularly
Biased toward listing more popular information
Missing many pages
Filled with duplicate listings

Search engine indexing and ranking may have economic,
social, political, and scientific effects
Indexing and ranking of online stores can substantially
effect economic viability
Some search engines charge for placement and
companies are willing to pay
Delayed indexing of scientific research can lead to the
duplication of work
Delayed or biased indexing may affect social or political
decisions

What do search engines do?
They attempt to index and provide access to the “relevant
web”
This is defined differently by different engines
It ranges from brute force indexing to the use of
algorithms to gauge relevance and popularity
Search engines/tools have four components
The collection of entries for their databases
The structure of their database
The search process
The interface

Data collection is done by
Humans, who review and index in the employ of the
search engine company (Yahoo!)
Humans, by self submission
Software
Software collection agents include automated robot
wanderers, spiders, harvesters, bots, and crawlers
They roam the internet (mostly www, gopher and ftp
sites), and bring back copies of resources
This actually means systematically downloading pages
and following links
They sort, index and create database entries out of them

The search component concerns the end user
It involves the interface between the human searcher and
the indexed database of resources
Several factors determine the success of a search engine:
The size of the database
The content and coverage of the database
The currency of the entries and frequency of updating
The elimination of redundancy and dead links
The speed of searching
The availability of advanced search features
The interface design and ease of use

Search engines provide “electronic egalitarianism”
Indexing and cataloguing tools are highly democratic
They categorize information differently than human
indexers do
Machine-based approaches to information gathering,
organization and retrieval provide uniform and equal
access to all the information on the Net
This is a source of one of our problems with search engines
We type in a search request and receive thousands of
URLs in response
These results frequently contain references to irrelevant
Web sites while leaving out others that hold important
material

How search engines work
Many search engines use two interdependent approaches
Browsing through subject trees and hierarchies
Keyword searching of an extensive database
A subject tree provides a structured and organized
hierarchy of categories for browsing for information by
subject
Under each category and/or sub- category, links to
appropriate Web pages are listed
Web pages are assigned categories either by the author
or by subject tree administrators
Many subject trees also have their own keyword
searchable indexes

Search tools with elaborate subject trees present links
with brief annotations
Examples include Yahoo, Galaxy, the WWW Virtual
Library)
Search engines allow keyword searching of indexes
These are automatically compiled by robots and spiders,
which are constantly collecting net resources
Searchers enter keywords to query the index
Some allow Boolean operators and other advanced
features
Web pages and other Internet resources that satisfy the
query are identified and listed

Search engines compete on the
Size of their indexes
Frequency of updating the index
Range of advanced search options
Speed of returning a result set
Result set presentation
Relevance of the items included in a result set
Design of the interface
Overall ease of use
Range of additional services offered

Engine Size Expected Actual Rank
Score Score
Google 560 1.0 1.0 1
FAST 340 2.0 1.8 2
Northern Light 265 3.0 2.3 3
HotBot 110 4.0 2.3 3
iWon 110 4.0 2.3 3
AltaVista 350 2.0 2.5 4
Yahoo-Google 560 1.0 3.0 5
Excite 250 3.0 3.0 5
Yahoo-Inktomi 110 4.0 4.3 6
Data from: July 2002
Searchengine Showdown
http://www.searchenginewatch.com/sereport/00/07-sizetest.html
Claimed size and “obscure search” test results

How big are they?
Google
FAST
Alta
Vista
Inktomi
Northern
Light
SearchEnginewatch, 7/02
http://www.searchenginewatch.com/reports/sizes.html

How big are they?
Google FAST Alta Vista
Inktomi Northern Light

Recent activity (indexing)
Google FAST Alta Vista
Inktomi Northern Light

Search engines are powered by robots, indexing software,
and “ontologists” who classify, sort, and arrange the Web
into a searchable matrix
The most popular search engines are always among
the most visited sites on the Net
Competition is high for the advertising dollars that
keep these search tools free of charge
Despite their similar approaches to scanning the Internet,
search engines don't always turn up the same results
Depending on the type of search being conducted, one
engine might give you more satisfactory results than
another

Three methods for indexing web resources
Full text index
Includes all terms and URLs
Uses filters to remove words not important to searching
Keyword index
Based on the location and frequency of words and
phrases
If a term is mentioned only once or twice, it won’t be
indexed
Human index
Created by individuals who review pages and select the
best words and phrases to describe their content

Engines use index searches, concept searches, or
browsing
Index searching
Many search engines use this method because it casts a
wider net than a catalog does
Results come from a dynamic index of pages and use an
algorithm to sort documents to determine relevance
For instance, the number of times a key-word appears as
well as its proximity to the top of the document
They don’t recognize context, synonyms, or homonyms
Searching "beat" returns Ginsberg and Burroughs but
also pages on metronomes, raves, and gingersnaps
There are problems of redundancy and dead links

Concept searching
With this type of search, your search term is treated as a
concept and not a keyword
If you type a word in the search box, you search for that
word, other forms of the word, and synonyms
The search also includes other words that are highly
statistically related to that word
A concept search looks for ideas related to a literal
query
Excite uses this strategy

Browsing services exist in great numbers on the net
These are systematically grouped hotlists, starting points
or systematic lists of interesting resources
These pages are typically smaller and well-maintained
The browsing structures typically do not use a controlled
system of knowledge-structuring,or an established
classification system
Selection, classification and description of the
resources are made by the list owner using
idiosyncratic criteria
Browsing systems covering rapidly changing areas are
more difficult to maintain because they often don’t have
automatic mechanisms for rapid and continual updating

What are the different types of search engines?
Single, niche, and multiple-threaded search engines
Single search engines
These engines operate alone
Your query is run against a single database and/or index
A directory search tool allows searches by subject matter
It is a hierarchical search that starts with a general
subject heading and follows with more specific sub-
headings.
The information is reviewed and indexed by humans
However, the number of reviews are limited.
Yahoo is an example

Niche search engines
These engines are like single search engines, except
that they cover a restricted subset of resources
Examples might include engines for business,
engineering, physics, or government information
A very restricted version of a niche engine only allows
you to search that site
Northern Light is a good example

Multiple-threaded search engines
These are also called meta-search engines
These engines submit your query to two or more search
engines simultaneously
They gather and display the results as a single page
These engines compete on the basis of the number and
variety of engines they allow you to search
These engines are becoming more popular
One problem is the amount of redundancy in the returns

What are the problems with search engines?
There are weaknesses and problems common to all
attempts to index the Internet
These are still more important than the limitations of
single search services
The theoretical problem of indexing virtual hypertext
It is not economical and not even possible to index all
information on the Internet in “full text”
It is necessary to define the limit of documents and
information units in order to allow a target-oriented
access while searching
In comparison with the world of printed information this
involves considerable difficulties

The information units are considerably smaller and less
defined
“Containers” like a book, a series, a journal title or an
issue do not occur often
The information units ranges in size from a whole
server or service to single text strings or icons
The mix of different types of information on the net make
uniform and homogenous indexing and searching
impossible
Document types include: directories, lists, menus, full-
text of every-day electronic mail, scientific articles and
books, field-structured database records, software,
audio, video, images and numerical information

Considering the great number of authors on the net and
their different abilities, the quality of input into the search
services varies a great deal
Often it is so poor that the search-results are seriously
influenced in a negative way
There is incorrect, uncontrolled HTML coding and
incomplete use of important content-describing metadata
like titles or keywords
Incorrect functional text mark up and abuse of the same
for layout purposes are occurring as well as the reverse,
abuse of layout markings as functional characterization

Other problems include
Terminological weaknesses, incorrect formulation of
titles and headings, and ambiguity
Inability to distinguish between permanent and
temporary documents
There are problems with harvesting methods, indexing
programs, IR-methods, and user interfaces
Performance problems

Data from: Aug. 14, 2001
Searchengine Showdown
http://www.notess.com/search/stats/dead.shtml
Search engine % Dead links % 400 errors
Alta vista 13.7% 9.3%
Excite 8.7% 5.7%
Northern Light 5.7% 2.0%
Google 4.3% 3.3%
Hotbot 2.3% 2.0%
Fast 2.3% 1.8%
MSN Inktomi 1.7% 1.0%
Anzwers 1.3% .07%
Dead links on search engines

search

Recommended

Recommended

More Related Content

Similar to search

Similar to search (20)

More from ssuserbad56d

More from ssuserbad56d (7)

Recently uploaded

Recently uploaded (20)

search