SlideShare a Scribd company logo
1 of 38
Introduction to Informatics - Fall 02
Accessing digital information: Search engines
I.The problem
• How to find the needle in the haystack
II. How search engines work
• Building the index
IV. Types of search engines
V. Problems with search engines
Introduction to Informatics - Fall 02
The Problem
The WWW contains more than 2.5 billion pages with 7.3
million pages added each day
The surface Web contains 19 terabytes (trillions of
bytes)
This where most of our stuff is
There are 7,500 terabytes hidden in the "deep" Web
This is largely proprietary information, dynamically
generated pages, or pages behind firewwalls
http://www.howstuffworks.com/news-item127.htm
Without the URLs of the particular pages you want, you
must rely on search engines to uncover potentially
relevant pages
Introduction to Informatics - Fall 02
The web lacks bibliographic control standards that we
take for granted in the print world
There is no equivalent to the ISBN to uniquely identify a
document
There is no standard system of cataloguing or
classification, analogous those developed by the
Library of Congress,
There is no central catalogue of the Web's holdings
Many documents lack the name of the author and the
date of publication
Updating? Version control? Not likely!
Introduction to Informatics - Fall 02
The net is not a digital library
It was not designed to support the organized
publication and retrieval of information
It has become a chaotic repository for the output of the
hundreds of thousands of digital publishers
It is filled with an amazing variety of digital artifacts
The ephemeral mixes everywhere with works of lasting
importance
The librarian’s classification and selection skills must be
complemented by the computer scientist’s ability to
automate the task of indexing, storing, and providing
access to information
Lynch (1999)
http://www.sciam.com/0397issue/0397lynch.html
Introduction to Informatics - Fall 02
How can we find what we want when we want it?
This is becoming an increasingly important problem for
people in the information professions
Strategies:
Follow links from page to page, hoping that you will
stumble across the pages that will help you answer the
question
Maintain a personal collection of bookmarks
Use search engines
~85% of all web sessions begin with or involve a
search
Introduction to Informatics - Fall 02
Search engines are a critical web tool
The Web offers the choice of hundreds of different
search tools
The problem is that each has its own database,
command language, search capabilities, and method of
displaying results
Each covers a different portion of the web with some
overlap
This means that you have to learn a variety of search
tools and develop effective search techniques to take
advantage of Web resources
Introduction to Informatics - Fall 02
Search engine coverage relative to the estimated size of the
publicly indexable web has decreased substantially since
December 97
No engine indexes more than about 16% of the web
Combined coverage of eleven major search engines is 42%
of the Web
Overlap between individual search engines is low, with
approximately 40% of a given search engine’s content
unique
The biggest search engines are currently FAST and
Google, with over 600 million pages indexed
http://healthlinks.washington.edu/hsl/liaisons/stanna/navweb/nav2.html
Introduction to Informatics - Fall 02
Search engines are typically more likely to index sites
that have more links to them (more 'popular' sites)
Google does this
Google interprets a link from page A to page B as a
vote, by page A, for page B
But, Google looks at more than the sheer volume of
votes, or links a page receives
It also analyzes the page that casts the vote
Votes cast by pages that are themselves
“important” weigh more heavily and help to make
other pages “important”
http://www.google.com/technology/index.html
Introduction to Informatics - Fall 02
Other difficulties with search engine coverage
Search engines are more likely to index US sites than non-
US sites (AltaVista is an exception), and more likely to
index .com sites than .edu sites
Indexing of new or modified pages by just one of the major
search engines can take months
The pages that one engine indexes do not have extensive
overlap with other indexes’ databases
Lawrence and Giles (1999)
http://www.wwwmetrics.com/
Introduction to Informatics - Fall 02
85% of users use search engines to find information (GVU
survey)
We use search engines to locate and buy goods and
research many decisions
Search engines are currently lacking in timeliness and
comprehensiveness and do not index sites equally
The current state of search engines can be compared to a
phone book which is
Updated irregularly
Biased toward listing more popular information
Missing many pages
Filled with duplicate listings
Introduction to Informatics - Fall 02
Search engine indexing and ranking may have economic,
social, political, and scientific effects
Indexing and ranking of online stores can substantially
effect economic viability
Some search engines charge for placement and
companies are willing to pay
Delayed indexing of scientific research can lead to the
duplication of work
Delayed or biased indexing may affect social or political
decisions
Introduction to Informatics - Fall 02
What do search engines do?
They attempt to index and provide access to the “relevant
web”
This is defined differently by different engines
It ranges from brute force indexing to the use of
algorithms to gauge relevance and popularity
Search engines/tools have four components
The collection of entries for their databases
The structure of their database
The search process
The interface
Introduction to Informatics - Fall 02
Data collection is done by
Humans, who review and index in the employ of the
search engine company (Yahoo!)
Humans, by self submission
Software
Software collection agents include automated robot
wanderers, spiders, harvesters, bots, and crawlers
They roam the internet (mostly www, gopher and ftp
sites), and bring back copies of resources
This actually means systematically downloading pages
and following links
They sort, index and create database entries out of them
Introduction to Informatics - Fall 02
The search component concerns the end user
It involves the interface between the human searcher and
the indexed database of resources
Several factors determine the success of a search engine:
The size of the database
The content and coverage of the database
The currency of the entries and frequency of updating
The elimination of redundancy and dead links
The speed of searching
The availability of advanced search features
The interface design and ease of use
Introduction to Informatics - Fall 02
Search engines provide “electronic egalitarianism”
Indexing and cataloguing tools are highly democratic
They categorize information differently than human
indexers do
Machine-based approaches to information gathering,
organization and retrieval provide uniform and equal
access to all the information on the Net
This is a source of one of our problems with search engines
We type in a search request and receive thousands of
URLs in response
These results frequently contain references to irrelevant
Web sites while leaving out others that hold important
material
Introduction to Informatics - Fall 02
Accessing digital information: Search engines
I.The problem
• How to find the needle in the haystack
II. How search engines work
• Building the index
IV. Types of search engines
V. Problems with search engines
Introduction to Informatics - Fall 02
How search engines work
Many search engines use two interdependent approaches
Browsing through subject trees and hierarchies
Keyword searching of an extensive database
A subject tree provides a structured and organized
hierarchy of categories for browsing for information by
subject
Under each category and/or sub- category, links to
appropriate Web pages are listed
Web pages are assigned categories either by the author
or by subject tree administrators
Many subject trees also have their own keyword
searchable indexes
Introduction to Informatics - Fall 02
Search tools with elaborate subject trees present links
with brief annotations
Examples include Yahoo, Galaxy, the WWW Virtual
Library)
Search engines allow keyword searching of indexes
These are automatically compiled by robots and spiders,
which are constantly collecting net resources
Searchers enter keywords to query the index
Some allow Boolean operators and other advanced
features
Web pages and other Internet resources that satisfy the
query are identified and listed
Introduction to Informatics - Fall 02
Search engines compete on the
Size of their indexes
Frequency of updating the index
Range of advanced search options
Speed of returning a result set
Result set presentation
Relevance of the items included in a result set
Design of the interface
Overall ease of use
Range of additional services offered
Introduction to Informatics - Fall 02
Engine Size Expected Actual Rank
Score Score
Google 560 1.0 1.0 1
FAST 340 2.0 1.8 2
Northern Light 265 3.0 2.3 3
HotBot 110 4.0 2.3 3
iWon 110 4.0 2.3 3
AltaVista 350 2.0 2.5 4
Yahoo-Google 560 1.0 3.0 5
Excite 250 3.0 3.0 5
Yahoo-Inktomi 110 4.0 4.3 6
Data from: July 2002
Searchengine Showdown
http://www.searchenginewatch.com/sereport/00/07-sizetest.html
Claimed size and “obscure search” test results
Introduction to Informatics - Fall 02
How big are they?
Google
FAST
Alta
Vista
Inktomi
Northern
Light
SearchEnginewatch, 7/02
http://www.searchenginewatch.com/reports/sizes.html
Introduction to Informatics - Fall 02
How big are they?
SearchEnginewatch, 7/02
http://www.searchenginewatch.com/reports/sizes.html
Google FAST Alta Vista
Inktomi Northern Light
Introduction to Informatics - Fall 02
Recent activity (indexing)
SearchEnginewatch, 7/02
http://www.searchenginewatch.com/reports/sizes.html
Google FAST Alta Vista
Inktomi Northern Light
Introduction to Informatics - Fall 02
Search engines are powered by robots, indexing software,
and “ontologists” who classify, sort, and arrange the Web
into a searchable matrix
The most popular search engines are always among
the most visited sites on the Net
Competition is high for the advertising dollars that
keep these search tools free of charge
Despite their similar approaches to scanning the Internet,
search engines don't always turn up the same results
Depending on the type of search being conducted, one
engine might give you more satisfactory results than
another
Introduction to Informatics - Fall 02
Three methods for indexing web resources
Full text index
Includes all terms and URLs
Uses filters to remove words not important to searching
Keyword index
Based on the location and frequency of words and
phrases
If a term is mentioned only once or twice, it won’t be
indexed
Human index
Created by individuals who review pages and select the
best words and phrases to describe their content
Introduction to Informatics - Fall 02
Engines use index searches, concept searches, or
browsing
Index searching
Many search engines use this method because it casts a
wider net than a catalog does
Results come from a dynamic index of pages and use an
algorithm to sort documents to determine relevance
For instance, the number of times a key-word appears as
well as its proximity to the top of the document
They don’t recognize context, synonyms, or homonyms
Searching "beat" returns Ginsberg and Burroughs but
also pages on metronomes, raves, and gingersnaps
There are problems of redundancy and dead links
Introduction to Informatics - Fall 02
Concept searching
With this type of search, your search term is treated as a
concept and not a keyword
If you type a word in the search box, you search for that
word, other forms of the word, and synonyms
The search also includes other words that are highly
statistically related to that word
A concept search looks for ideas related to a literal
query
Excite uses this strategy
Introduction to Informatics - Fall 02
Browsing services exist in great numbers on the net
These are systematically grouped hotlists, starting points
or systematic lists of interesting resources
These pages are typically smaller and well-maintained
The browsing structures typically do not use a controlled
system of knowledge-structuring,or an established
classification system
Selection, classification and description of the
resources are made by the list owner using
idiosyncratic criteria
Browsing systems covering rapidly changing areas are
more difficult to maintain because they often don’t have
automatic mechanisms for rapid and continual updating
Introduction to Informatics - Fall 02
Accessing digital information: Search engines
I.The problem
• How to find the needle in the haystack
II. How search engines work
• Building the index
IV. Types of search engines
V. Problems with search engines
Introduction to Informatics - Fall 02
What are the different types of search engines?
Single, niche, and multiple-threaded search engines
Single search engines
These engines operate alone
Your query is run against a single database and/or index
A directory search tool allows searches by subject matter
It is a hierarchical search that starts with a general
subject heading and follows with more specific sub-
headings.
The information is reviewed and indexed by humans
However, the number of reviews are limited.
Yahoo is an example
Introduction to Informatics - Fall 02
Niche search engines
These engines are like single search engines, except
that they cover a restricted subset of resources
Examples might include engines for business,
engineering, physics, or government information
A very restricted version of a niche engine only allows
you to search that site
Northern Light is a good example
Introduction to Informatics - Fall 02
Multiple-threaded search engines
These are also called meta-search engines
These engines submit your query to two or more search
engines simultaneously
They gather and display the results as a single page
These engines compete on the basis of the number and
variety of engines they allow you to search
These engines are becoming more popular
One problem is the amount of redundancy in the returns
Introduction to Informatics - Fall 02
Accessing digital information: Search engines
I.The problem
• How to find the needle in the haystack
II. How search engines work
• Building the index
IV. Types of search engines
V. Problems with search engines
Introduction to Informatics - Fall 02
What are the problems with search engines?
There are weaknesses and problems common to all
attempts to index the Internet
These are still more important than the limitations of
single search services
The theoretical problem of indexing virtual hypertext
It is not economical and not even possible to index all
information on the Internet in “full text”
It is necessary to define the limit of documents and
information units in order to allow a target-oriented
access while searching
In comparison with the world of printed information this
involves considerable difficulties
Introduction to Informatics - Fall 02
The information units are considerably smaller and less
defined
“Containers” like a book, a series, a journal title or an
issue do not occur often
The information units ranges in size from a whole
server or service to single text strings or icons
The mix of different types of information on the net make
uniform and homogenous indexing and searching
impossible
Document types include: directories, lists, menus, full-
text of every-day electronic mail, scientific articles and
books, field-structured database records, software,
audio, video, images and numerical information
Introduction to Informatics - Fall 02
Considering the great number of authors on the net and
their different abilities, the quality of input into the search
services varies a great deal
Often it is so poor that the search-results are seriously
influenced in a negative way
There is incorrect, uncontrolled HTML coding and
incomplete use of important content-describing metadata
like titles or keywords
Incorrect functional text mark up and abuse of the same
for layout purposes are occurring as well as the reverse,
abuse of layout markings as functional characterization
Introduction to Informatics - Fall 02
Other problems include
Terminological weaknesses, incorrect formulation of
titles and headings, and ambiguity
Inability to distinguish between permanent and
temporary documents
There are problems with harvesting methods, indexing
programs, IR-methods, and user interfaces
Performance problems
Introduction to Informatics - Fall 02
Data from: Aug. 14, 2001
Searchengine Showdown
http://www.notess.com/search/stats/dead.shtml
Search engine % Dead links % 400 errors
Alta vista 13.7% 9.3%
Excite 8.7% 5.7%
Northern Light 5.7% 2.0%
Google 4.3% 3.3%
Hotbot 2.3% 2.0%
Fast 2.3% 1.8%
MSN Inktomi 1.7% 1.0%
Anzwers 1.3% .07%
Dead links on search engines

More Related Content

Similar to search

Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
inventionjournals
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniques
Tola Odugbesan
 
ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010
steverz
 

Similar to search (20)

International conference On Computer Science And technology
International conference On Computer Science And technologyInternational conference On Computer Science And technology
International conference On Computer Science And technology
 
Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the Web
 
Sweeny Seo30 Web20 Finalversion
Sweeny Seo30 Web20 FinalversionSweeny Seo30 Web20 Finalversion
Sweeny Seo30 Web20 Finalversion
 
Sweeny Seo30 Web20 Final
Sweeny Seo30 Web20 FinalSweeny Seo30 Web20 Final
Sweeny Seo30 Web20 Final
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Pagerank
PagerankPagerank
Pagerank
 
Searching the web general
Searching the web generalSearching the web general
Searching the web general
 
Search engines by Gulshan K Maheshwari(QAU)
Search engines by Gulshan  K Maheshwari(QAU)Search engines by Gulshan  K Maheshwari(QAU)
Search engines by Gulshan K Maheshwari(QAU)
 
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
 
IRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search ResultsIRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search Results
 
IST 561 Spring 2007--Session7, Sources of Information
IST 561 Spring 2007--Session7, Sources of InformationIST 561 Spring 2007--Session7, Sources of Information
IST 561 Spring 2007--Session7, Sources of Information
 
web mining
web miningweb mining
web mining
 
Uw Digital Communications Social Media Is Not Search
Uw Digital Communications Social Media Is Not SearchUw Digital Communications Social Media Is Not Search
Uw Digital Communications Social Media Is Not Search
 
History page-brin thesis - anatomy of a large scale hypertextual web search...
History   page-brin thesis - anatomy of a large scale hypertextual web search...History   page-brin thesis - anatomy of a large scale hypertextual web search...
History page-brin thesis - anatomy of a large scale hypertextual web search...
 
The beginners guide to SEO
The beginners guide to SEOThe beginners guide to SEO
The beginners guide to SEO
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Seo
SeoSeo
Seo
 
Internet Tutorial 03
Internet  Tutorial 03Internet  Tutorial 03
Internet Tutorial 03
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniques
 
ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010
 

More from ssuserbad56d (7)

search
searchsearch
search
 
Scaling Web Applications with Cassandra Presentation.ppt
Scaling Web Applications with Cassandra Presentation.pptScaling Web Applications with Cassandra Presentation.ppt
Scaling Web Applications with Cassandra Presentation.ppt
 
Cassandra
CassandraCassandra
Cassandra
 
Redis
RedisRedis
Redis
 
Covered Call
Covered CallCovered Call
Covered Call
 
Lec04.pdf
Lec04.pdfLec04.pdf
Lec04.pdf
 
Project.pdf
Project.pdfProject.pdf
Project.pdf
 

Recently uploaded

Query optimization and processing for advanced database systems
Query optimization and processing for advanced database systemsQuery optimization and processing for advanced database systems
Query optimization and processing for advanced database systems
meharikiros2
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
pritamlangde
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 

Recently uploaded (20)

8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
 
Ground Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth ReinforcementGround Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth Reinforcement
 
Query optimization and processing for advanced database systems
Query optimization and processing for advanced database systemsQuery optimization and processing for advanced database systems
Query optimization and processing for advanced database systems
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Linux Systems Programming: Inter Process Communication (IPC) using Pipes
Linux Systems Programming: Inter Process Communication (IPC) using PipesLinux Systems Programming: Inter Process Communication (IPC) using Pipes
Linux Systems Programming: Inter Process Communication (IPC) using Pipes
 
Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptx
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
 
Augmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxAugmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptx
 
Path loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata ModelPath loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata Model
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 

search

  • 1. Introduction to Informatics - Fall 02 Accessing digital information: Search engines I.The problem • How to find the needle in the haystack II. How search engines work • Building the index IV. Types of search engines V. Problems with search engines
  • 2. Introduction to Informatics - Fall 02 The Problem The WWW contains more than 2.5 billion pages with 7.3 million pages added each day The surface Web contains 19 terabytes (trillions of bytes) This where most of our stuff is There are 7,500 terabytes hidden in the "deep" Web This is largely proprietary information, dynamically generated pages, or pages behind firewwalls http://www.howstuffworks.com/news-item127.htm Without the URLs of the particular pages you want, you must rely on search engines to uncover potentially relevant pages
  • 3. Introduction to Informatics - Fall 02 The web lacks bibliographic control standards that we take for granted in the print world There is no equivalent to the ISBN to uniquely identify a document There is no standard system of cataloguing or classification, analogous those developed by the Library of Congress, There is no central catalogue of the Web's holdings Many documents lack the name of the author and the date of publication Updating? Version control? Not likely!
  • 4. Introduction to Informatics - Fall 02 The net is not a digital library It was not designed to support the organized publication and retrieval of information It has become a chaotic repository for the output of the hundreds of thousands of digital publishers It is filled with an amazing variety of digital artifacts The ephemeral mixes everywhere with works of lasting importance The librarian’s classification and selection skills must be complemented by the computer scientist’s ability to automate the task of indexing, storing, and providing access to information Lynch (1999) http://www.sciam.com/0397issue/0397lynch.html
  • 5. Introduction to Informatics - Fall 02 How can we find what we want when we want it? This is becoming an increasingly important problem for people in the information professions Strategies: Follow links from page to page, hoping that you will stumble across the pages that will help you answer the question Maintain a personal collection of bookmarks Use search engines ~85% of all web sessions begin with or involve a search
  • 6. Introduction to Informatics - Fall 02 Search engines are a critical web tool The Web offers the choice of hundreds of different search tools The problem is that each has its own database, command language, search capabilities, and method of displaying results Each covers a different portion of the web with some overlap This means that you have to learn a variety of search tools and develop effective search techniques to take advantage of Web resources
  • 7. Introduction to Informatics - Fall 02 Search engine coverage relative to the estimated size of the publicly indexable web has decreased substantially since December 97 No engine indexes more than about 16% of the web Combined coverage of eleven major search engines is 42% of the Web Overlap between individual search engines is low, with approximately 40% of a given search engine’s content unique The biggest search engines are currently FAST and Google, with over 600 million pages indexed http://healthlinks.washington.edu/hsl/liaisons/stanna/navweb/nav2.html
  • 8. Introduction to Informatics - Fall 02 Search engines are typically more likely to index sites that have more links to them (more 'popular' sites) Google does this Google interprets a link from page A to page B as a vote, by page A, for page B But, Google looks at more than the sheer volume of votes, or links a page receives It also analyzes the page that casts the vote Votes cast by pages that are themselves “important” weigh more heavily and help to make other pages “important” http://www.google.com/technology/index.html
  • 9. Introduction to Informatics - Fall 02 Other difficulties with search engine coverage Search engines are more likely to index US sites than non- US sites (AltaVista is an exception), and more likely to index .com sites than .edu sites Indexing of new or modified pages by just one of the major search engines can take months The pages that one engine indexes do not have extensive overlap with other indexes’ databases Lawrence and Giles (1999) http://www.wwwmetrics.com/
  • 10. Introduction to Informatics - Fall 02 85% of users use search engines to find information (GVU survey) We use search engines to locate and buy goods and research many decisions Search engines are currently lacking in timeliness and comprehensiveness and do not index sites equally The current state of search engines can be compared to a phone book which is Updated irregularly Biased toward listing more popular information Missing many pages Filled with duplicate listings
  • 11. Introduction to Informatics - Fall 02 Search engine indexing and ranking may have economic, social, political, and scientific effects Indexing and ranking of online stores can substantially effect economic viability Some search engines charge for placement and companies are willing to pay Delayed indexing of scientific research can lead to the duplication of work Delayed or biased indexing may affect social or political decisions
  • 12. Introduction to Informatics - Fall 02 What do search engines do? They attempt to index and provide access to the “relevant web” This is defined differently by different engines It ranges from brute force indexing to the use of algorithms to gauge relevance and popularity Search engines/tools have four components The collection of entries for their databases The structure of their database The search process The interface
  • 13. Introduction to Informatics - Fall 02 Data collection is done by Humans, who review and index in the employ of the search engine company (Yahoo!) Humans, by self submission Software Software collection agents include automated robot wanderers, spiders, harvesters, bots, and crawlers They roam the internet (mostly www, gopher and ftp sites), and bring back copies of resources This actually means systematically downloading pages and following links They sort, index and create database entries out of them
  • 14. Introduction to Informatics - Fall 02 The search component concerns the end user It involves the interface between the human searcher and the indexed database of resources Several factors determine the success of a search engine: The size of the database The content and coverage of the database The currency of the entries and frequency of updating The elimination of redundancy and dead links The speed of searching The availability of advanced search features The interface design and ease of use
  • 15. Introduction to Informatics - Fall 02 Search engines provide “electronic egalitarianism” Indexing and cataloguing tools are highly democratic They categorize information differently than human indexers do Machine-based approaches to information gathering, organization and retrieval provide uniform and equal access to all the information on the Net This is a source of one of our problems with search engines We type in a search request and receive thousands of URLs in response These results frequently contain references to irrelevant Web sites while leaving out others that hold important material
  • 16. Introduction to Informatics - Fall 02 Accessing digital information: Search engines I.The problem • How to find the needle in the haystack II. How search engines work • Building the index IV. Types of search engines V. Problems with search engines
  • 17. Introduction to Informatics - Fall 02 How search engines work Many search engines use two interdependent approaches Browsing through subject trees and hierarchies Keyword searching of an extensive database A subject tree provides a structured and organized hierarchy of categories for browsing for information by subject Under each category and/or sub- category, links to appropriate Web pages are listed Web pages are assigned categories either by the author or by subject tree administrators Many subject trees also have their own keyword searchable indexes
  • 18. Introduction to Informatics - Fall 02 Search tools with elaborate subject trees present links with brief annotations Examples include Yahoo, Galaxy, the WWW Virtual Library) Search engines allow keyword searching of indexes These are automatically compiled by robots and spiders, which are constantly collecting net resources Searchers enter keywords to query the index Some allow Boolean operators and other advanced features Web pages and other Internet resources that satisfy the query are identified and listed
  • 19. Introduction to Informatics - Fall 02 Search engines compete on the Size of their indexes Frequency of updating the index Range of advanced search options Speed of returning a result set Result set presentation Relevance of the items included in a result set Design of the interface Overall ease of use Range of additional services offered
  • 20. Introduction to Informatics - Fall 02 Engine Size Expected Actual Rank Score Score Google 560 1.0 1.0 1 FAST 340 2.0 1.8 2 Northern Light 265 3.0 2.3 3 HotBot 110 4.0 2.3 3 iWon 110 4.0 2.3 3 AltaVista 350 2.0 2.5 4 Yahoo-Google 560 1.0 3.0 5 Excite 250 3.0 3.0 5 Yahoo-Inktomi 110 4.0 4.3 6 Data from: July 2002 Searchengine Showdown http://www.searchenginewatch.com/sereport/00/07-sizetest.html Claimed size and “obscure search” test results
  • 21. Introduction to Informatics - Fall 02 How big are they? Google FAST Alta Vista Inktomi Northern Light SearchEnginewatch, 7/02 http://www.searchenginewatch.com/reports/sizes.html
  • 22. Introduction to Informatics - Fall 02 How big are they? SearchEnginewatch, 7/02 http://www.searchenginewatch.com/reports/sizes.html Google FAST Alta Vista Inktomi Northern Light
  • 23. Introduction to Informatics - Fall 02 Recent activity (indexing) SearchEnginewatch, 7/02 http://www.searchenginewatch.com/reports/sizes.html Google FAST Alta Vista Inktomi Northern Light
  • 24. Introduction to Informatics - Fall 02 Search engines are powered by robots, indexing software, and “ontologists” who classify, sort, and arrange the Web into a searchable matrix The most popular search engines are always among the most visited sites on the Net Competition is high for the advertising dollars that keep these search tools free of charge Despite their similar approaches to scanning the Internet, search engines don't always turn up the same results Depending on the type of search being conducted, one engine might give you more satisfactory results than another
  • 25. Introduction to Informatics - Fall 02 Three methods for indexing web resources Full text index Includes all terms and URLs Uses filters to remove words not important to searching Keyword index Based on the location and frequency of words and phrases If a term is mentioned only once or twice, it won’t be indexed Human index Created by individuals who review pages and select the best words and phrases to describe their content
  • 26. Introduction to Informatics - Fall 02 Engines use index searches, concept searches, or browsing Index searching Many search engines use this method because it casts a wider net than a catalog does Results come from a dynamic index of pages and use an algorithm to sort documents to determine relevance For instance, the number of times a key-word appears as well as its proximity to the top of the document They don’t recognize context, synonyms, or homonyms Searching "beat" returns Ginsberg and Burroughs but also pages on metronomes, raves, and gingersnaps There are problems of redundancy and dead links
  • 27. Introduction to Informatics - Fall 02 Concept searching With this type of search, your search term is treated as a concept and not a keyword If you type a word in the search box, you search for that word, other forms of the word, and synonyms The search also includes other words that are highly statistically related to that word A concept search looks for ideas related to a literal query Excite uses this strategy
  • 28. Introduction to Informatics - Fall 02 Browsing services exist in great numbers on the net These are systematically grouped hotlists, starting points or systematic lists of interesting resources These pages are typically smaller and well-maintained The browsing structures typically do not use a controlled system of knowledge-structuring,or an established classification system Selection, classification and description of the resources are made by the list owner using idiosyncratic criteria Browsing systems covering rapidly changing areas are more difficult to maintain because they often don’t have automatic mechanisms for rapid and continual updating
  • 29. Introduction to Informatics - Fall 02 Accessing digital information: Search engines I.The problem • How to find the needle in the haystack II. How search engines work • Building the index IV. Types of search engines V. Problems with search engines
  • 30. Introduction to Informatics - Fall 02 What are the different types of search engines? Single, niche, and multiple-threaded search engines Single search engines These engines operate alone Your query is run against a single database and/or index A directory search tool allows searches by subject matter It is a hierarchical search that starts with a general subject heading and follows with more specific sub- headings. The information is reviewed and indexed by humans However, the number of reviews are limited. Yahoo is an example
  • 31. Introduction to Informatics - Fall 02 Niche search engines These engines are like single search engines, except that they cover a restricted subset of resources Examples might include engines for business, engineering, physics, or government information A very restricted version of a niche engine only allows you to search that site Northern Light is a good example
  • 32. Introduction to Informatics - Fall 02 Multiple-threaded search engines These are also called meta-search engines These engines submit your query to two or more search engines simultaneously They gather and display the results as a single page These engines compete on the basis of the number and variety of engines they allow you to search These engines are becoming more popular One problem is the amount of redundancy in the returns
  • 33. Introduction to Informatics - Fall 02 Accessing digital information: Search engines I.The problem • How to find the needle in the haystack II. How search engines work • Building the index IV. Types of search engines V. Problems with search engines
  • 34. Introduction to Informatics - Fall 02 What are the problems with search engines? There are weaknesses and problems common to all attempts to index the Internet These are still more important than the limitations of single search services The theoretical problem of indexing virtual hypertext It is not economical and not even possible to index all information on the Internet in “full text” It is necessary to define the limit of documents and information units in order to allow a target-oriented access while searching In comparison with the world of printed information this involves considerable difficulties
  • 35. Introduction to Informatics - Fall 02 The information units are considerably smaller and less defined “Containers” like a book, a series, a journal title or an issue do not occur often The information units ranges in size from a whole server or service to single text strings or icons The mix of different types of information on the net make uniform and homogenous indexing and searching impossible Document types include: directories, lists, menus, full- text of every-day electronic mail, scientific articles and books, field-structured database records, software, audio, video, images and numerical information
  • 36. Introduction to Informatics - Fall 02 Considering the great number of authors on the net and their different abilities, the quality of input into the search services varies a great deal Often it is so poor that the search-results are seriously influenced in a negative way There is incorrect, uncontrolled HTML coding and incomplete use of important content-describing metadata like titles or keywords Incorrect functional text mark up and abuse of the same for layout purposes are occurring as well as the reverse, abuse of layout markings as functional characterization
  • 37. Introduction to Informatics - Fall 02 Other problems include Terminological weaknesses, incorrect formulation of titles and headings, and ambiguity Inability to distinguish between permanent and temporary documents There are problems with harvesting methods, indexing programs, IR-methods, and user interfaces Performance problems
  • 38. Introduction to Informatics - Fall 02 Data from: Aug. 14, 2001 Searchengine Showdown http://www.notess.com/search/stats/dead.shtml Search engine % Dead links % 400 errors Alta vista 13.7% 9.3% Excite 8.7% 5.7% Northern Light 5.7% 2.0% Google 4.3% 3.3% Hotbot 2.3% 2.0% Fast 2.3% 1.8% MSN Inktomi 1.7% 1.0% Anzwers 1.3% .07% Dead links on search engines