Verbal Subject Analysis III: Webpage Databases AKA - The “Automatic Indexing” Lecture

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Verbal Subject Analysis III: Webpage Databases AKA - The “Automatic Indexing” Lecture - Presentation Transcript

    1. LS 500 Lecture 10 Verbal Subject Analysis III: Webpage Databases AKA - The “Automatic Indexing” Lecture Steven L. MacCall, Ph.D. Associate Professor School of Library and Information Studies The University of Alabama
    2. Human versus Automatic Indexing
      • Both are related to the subject analysis of information packages.
      • Human indexing is used to describe the subject analysis operations of various periodical databases.
      • Automatic indexing is term used for the subject analysis operations by computers of various webpage databases.
    3. Why Webpage Database?
      • It is always important to know the documentary unit of an informational database.
      • The adjective associated with database is always a clue to the documentary unit.
      • Webpage databases are informational databases in which a webpage is the documentary unit.
      • Also referred to as search engines and discovered databases .
    4. Analysis of Websites and their Structure
      • What are webpages?
      • What is a website?
      • Standards (or lack thereof) for the authoring of web sites and webpages:
        • HTML and other markup languages?
        • Editors?
      • What are the implications of the lack of authoring standards for web-based information packages?
    5. Webpage Database Questions
      • Why do search engines produce different results to the exact same query?
      • What is the principle for ranking the display of search engine records in response to a query?
    6. Webpage Subject Metadata
      • For individual webpages, subject metadata can created by authors and included in document header: Meta data profiles .
      • In databases of metadata records, subject metadata can be created by intermediaries using Dublin Core schema (Example: Worthington Memory ).
      • In webpage databases, subject metadata is inferred "automatically" by computer algorithm.
    7. The Term “Search Engine”
      • Has become the common designation for webpage databases.
      • However in actuality, webpage databases have three parts:
        • Spidering/crawling software to collect webpages
        • Indexing software to build the index of surrogate records
        • Retrieval software to facilitate retrieval of surrogates
    8. Inverted File Structures
      • How surrogate records are physically stored in the index of a database.
      • Each surrogate record has a unique identifier (also called a pointer ).
      • Each word and phrase of the index has a record in the index; each record contains the UI for each surrogate record that contains that word or phrase:
        • Dog: 235 ; 527 ; 5,345,672 ; 117,127,923
        • Cat: 127 ; 2,753 ; 917,538 ; 327,543,238
    9. Automatic Indexing in Context
      • Obtain information package – spidering/crawling.
      • Describe information package in surrogate record – read off webpages by indexing software.
      • Subject analyze information package in surrogate record – indexing software:
        • Verbal – inferred by computer algorithm
        • Classification – inferred by computer algorithm
    10. Obtaining Webpages
      • What really happens when you “surf” the web?
      • What happens when Googlebot or Slurp surf the web?
      • Steps for spidering/crawling:
        • Computers owned by search engine retrieve documents by clicking on all hyperlinks on each retrieved webpage
        • Determination is made whether a webpage needs to be indexed (because it is new) or reindexed (if it has already been indexed)
        • Determination is made whether reindexing is warranted
        • New webpages and those meeting criteria for reindexing are then placed in the indexing queue
    11. Describing Webpages
      • Left side elements must be inferred by searcher:
        • Examine structure of retrieved records
        • Examine advanced search interface
        • Element sets are not standard, i.e., they will vary across search engines
      • Right side content:
        • What is the source for content?
        • Authority control?
    12. Subject Indexing of Webpage Databases
      • The subject fields of webpage surrogate records include the words that describe what the webpage is about.
      • Right side subject content is inferred through the application of proprietary algorithms.
      • Subject terms added to surrogate records are weighted:
        • Doc #1: SU = dogs (.99); breeding (.87); dachshund (.30)
        • Doc #2: SU = cats (.92); dogs (.44); dachshund (.03)
        • The weights are computed by proprietary algorithm
    13. How are Subject Weights Calculated?
      • Conventional methods (dating from the 1950’s) for automatically inferring what a document is about include the following three techniques:
        • Frequency of word occurrences
        • Location of words occurrences
        • Size of word occurrences
      • In the web era, however, these techniques did not scale well to meet the needs of databases containing billions of records:
        • Could facilitate retrieval of relevant documents, but could not distinguish between “good” and “bad” documents
        • Were also subject to manipulation by authors desiring higher search engine retrieval (spamming)
    14. Retrieval from Webpage Databases
      • Unlike bibliographic databases, in which the ordering of retrieved surrogate records is reverse chronological, webpage databases use a relevance-based ranking.
      • The search engine component of a webpage database takes the entered query and compares it to the terms to the index.
      • The documents that are retrieved first are those that contain a higher “relevance” score:
        • Doc #1: SU = dogs (.99); breeding (.87); dachshund (.30)
        • Doc #2: SU = cats (.92); dogs (.44); dachshund (.03)
        • “ dog” query would rank document #1 ahead of document #2
        • “ breeding” query would rank document #1 ahead of document #2
        • “ cats” query would rank document #2 ahead of document #1
    15. Two Responses to Search Engine Failure
      • Yahoo! era (late 1990’s):
        • Human indexing (website directories)
        • More discussion during lectures on classification
      • Google era (since 1999):
        • Additional criteria introduced to infer aboutness, e.g.,:
          • $ – paid submissions, such as Alta Vista
          • Quality – Pagerank algorithm of Google
    16. More on Google Breakthrough  
      • Issue addressed by Google concerns the quality problem: How to cause the “best” documents to rise to the top of a set of retrieved webpages.
      • Solution concerns identifying additional criteria to include in the subject weighting algorithm.
      • Google maintains additional metadata elements for each surrogate record in its index of webpages:
        • How many other webpages link to a given webpage?
        • Who are these linkers?
    17. Google’s Quality Approach
      • How many other webpages link to a given webpage?
        • The more webpages (i.e., linkers) a dachshund webpage has pointing to it, the more quality it has.
        • This factors into the weight assigned to the “dachshund” descriptor in the subject field of its surrogate record
      • Who are these linkers?:
        • Those linkers that have a higher quality rank are given more weight than those linkers with a lower quality rank
      • Google Factory Tour (5/19/05):
        • Go to http://tinyurl.com/yapvve
        • Begin with slide #32.
    18. Query Assistance
      • Also called query expansion .
      • Though controlled vocabularies are not implemented, some webpage databases do provide query assistance:
        • Spell checking: “MacCall” in Google
        • Query expansion: “dachshunds” in ask.com
      • Important to note that these services are automatically generated by computer algorithm, thus subject to incompleteness or error!
    19. Alternative Displays of Retrieved Records
      • Zapmeta provides:
        • Sort options for retrieved set
        • Amazon entries
        • Archive.org link
        • Preview option
      • SearchMash (from Google).
      • Kartoo , Mooter and WebBrain provide a completely different retrieval environment!
    20. General Tips for Webpage Databases
      • Single database or multiple databases searched?:
        • Google as example of searching a single database
        • Metacrawler as example of searching multiple databases
      • Search engine sizes (SearchEngineWatch.com)
      • Searches per day (SearchEngineWatch.com)
    21. General Tips for Webpage Databases
      • As is the case with bibliographic databases, webpage databases creators will licensing their content to be used under different interfaces:
        • Search Engine Relationship Chart (Search This)
        • Search Providers Chart (SearchEngineWatch.com)
        • Search Engine Ratings (SearchEngineWatch.com)
      • Search tips:
        • Search engine math (SearchEngineWatch.com)
        • Search engine features chart (SearchEngineWatch.com)
        • Finding information: search engines (P. Bradley)

    + Steven MacCallSteven MacCall, 2 years ago

    custom

    508 views, 0 favs, 1 embeds more stats

    This is the tenth lecture in a series presented to more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 508
      • 507 on SlideShare
      • 1 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 8
    Most viewed embeds
    • 1 views on http://localhost

    more

    All embeds
    • 1 views on http://localhost

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories