Search Engines


Published on

Published in: Technology, Design
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • © Tefko Saracevic Principles of Searching
  • Search Engines

    1. 1. part 1: search engines © Chidanand Byahatti
    2. 2. dictionary definitions <ul><li>search </li></ul><ul><ul><li>COMPUTING ( transitive verb ) to examine a computer file, disk, database, or network for particular information </li></ul></ul><ul><li>engine </li></ul><ul><ul><li>something that supplies the driving force or energy to a movement, system, or trend </li></ul></ul><ul><li>search engine </li></ul><ul><ul><li>a computer program that searches for particular keywords and returns a list of documents in which they were found, especially a commercial service that scans documents on the Internet </li></ul></ul>© Chidanand Byahatti
    3. 3. about definition of search engines <ul><li>oh well … </li></ul><ul><li>search engines do not search only for keywords, some search for other stuff as well </li></ul><ul><li>and they are really not “engines” in the classical sense </li></ul><ul><ul><li>but then mouse is not a “mouse” </li></ul></ul>© Chidanand Byahatti
    4. 4. use of search engines … among others © Chidanand Byahatti
    5. 5. How Search Engines Work (Sherman 2003) © Chidanand Byahatti Your Browser The Web URL1 URL2 URL3 URL4 Crawler Indexer Search Engine Database Eggs? Eggs. Eggs - 90% Eggo - 81% Ego- 40% Huh? - 10% All About Eggs by S. I. Am
    6. 6. how do search engines work? elaboration <ul><li>crawlers, spiders : go out to find content </li></ul><ul><ul><li>in various ways go through the web looking for new & changed sites </li></ul></ul><ul><ul><li>periodic, not for each query </li></ul></ul><ul><ul><ul><li>no search engine works in real time </li></ul></ul></ul><ul><ul><li>some search engines do it for themselves, others not </li></ul></ul><ul><ul><ul><li>buy content from companies such as Inktomi </li></ul></ul></ul><ul><ul><li>for a number of reasons crawlers do not cover all of the web – just a fraction </li></ul></ul><ul><ul><li>what is not covered is “invisible web” </li></ul></ul>© Chidanand Byahatti
    7. 7. elaboration … <ul><li>organizing content: labeling, arranging </li></ul><ul><ul><li>indexing for searching – automatic </li></ul></ul><ul><ul><ul><li>keywords and other fields </li></ul></ul></ul><ul><ul><ul><li>arranging by URL popularity - PageRank as Google </li></ul></ul></ul><ul><ul><li>classifying as directory </li></ul></ul><ul><ul><ul><li>mostly human handpicked & classified </li></ul></ul></ul><ul><li>as a result of different organization we have basically two kinds of search engines: </li></ul><ul><ul><ul><li>search – input is a query that is then searched & displayed </li></ul></ul></ul><ul><ul><ul><li>directory – classified content – a class is displayed </li></ul></ul></ul><ul><ul><ul><ul><li>and fused: directories have now also search capabilities & vice versa </li></ul></ul></ul></ul>© Chidanand Byahatti
    8. 8. elaboration (cont.) <ul><li>databases, caches: storing content </li></ul><ul><ul><li>humongous files usually distributed over many computers </li></ul></ul><ul><li>query processor: searching, retrieval, display </li></ul><ul><ul><li>takes your query as input </li></ul></ul><ul><ul><ul><li>engines have differing rules how handled </li></ul></ul></ul><ul><ul><li>displays ranked output </li></ul></ul><ul><ul><ul><li>some engines also cluster output and provide visualization </li></ul></ul></ul><ul><li>at the other end is your browser </li></ul>© Chidanand Byahatti
    9. 9. elaboration… similarities, differences <ul><li>all search engines have these basic parts in common </li></ul><ul><li>BUT the actual processes – methods how they do it – are based on various algorithms & they differ </li></ul><ul><ul><li>most are proprietary with details kept mostly secret but based on well known principles from information retrieval or classification </li></ul></ul><ul><ul><li>to some extent Google is an exception – they published their method </li></ul></ul>© Chidanand Byahatti
    10. 10. case of <ul><li>developed by Sergey Brin and Lawrence Page while students at Stanford </li></ul><ul><ul><li>in the beginning run on Stanford computers </li></ul></ul><ul><li>basic approach has been described in their famous paper “The Anatomy of a Large-Scale Hypertextual Web Search Engine” </li></ul><ul><ul><li>well written, simple language, has their pictures </li></ul></ul><ul><ul><li>in acknowledgement they cite the support by NSF’s Digital Library Initiative i.e. initially, Google came out of government sponsored research </li></ul></ul><ul><ul><li>describe their method PageRank - based on ranking hyperlinks as in citation indexing </li></ul></ul><ul><ul><li>“ We chose our system name, Google, because it is a common spelling of googol, or ten on hundredth power” </li></ul></ul>© Chidanand Byahatti
    11. 11. coverage differences <ul><li>no engine covers more than a fraction of WWW </li></ul><ul><ul><li>estimates: none more than 16% </li></ul></ul><ul><ul><li>hard (even impossible) to discern & compare coverage, but they differ substantially in what they cover </li></ul></ul><ul><li>in addition: </li></ul><ul><ul><li>many national search engines </li></ul></ul><ul><ul><ul><li>own coverage, orientation, governance </li></ul></ul></ul><ul><ul><li>many specialized or domain search engines </li></ul></ul><ul><ul><ul><li>own coverage geared to subject of interest </li></ul></ul></ul><ul><ul><li>many comprehensive sources independent of search engines </li></ul></ul><ul><ul><ul><li>some have compilations of evaluated web sources </li></ul></ul></ul>© Chidanand Byahatti
    12. 12. searching differences <ul><li>substantial differences among search engines on searching, retrieval display </li></ul><ul><ul><li>need to know how they work & differ in respect to </li></ul></ul><ul><ul><ul><li>defaults in searching a query </li></ul></ul></ul><ul><ul><ul><li>searching of phrases, case sensitivity, categories </li></ul></ul></ul><ul><ul><ul><li>searching of different fields, formats, types of resources </li></ul></ul></ul><ul><ul><ul><li>advance search capabilities and features </li></ul></ul></ul><ul><ul><ul><li>possibilities for refinement, using relevance feedback </li></ul></ul></ul><ul><ul><ul><li>display options </li></ul></ul></ul><ul><ul><ul><li>personalization options </li></ul></ul></ul>© Chidanand Byahatti
    13. 13. business model differences <ul><li>several business models </li></ul><ul><li>public good - have independent budget </li></ul><ul><ul><li>e.g. PubMed , Librarians’ Index to Internet </li></ul></ul><ul><li>earn revenue from provision of information </li></ul><ul><ul><li>all commercial search engines </li></ul></ul><ul><li>using search engines to promote their other activities </li></ul><ul><ul><li>e.g. telephone directories </li></ul></ul>© Chidanand Byahatti
    14. 14. sponsorship differences <ul><li>need to understand treatment of sponsorship – they influence what they search & how they display results </li></ul><ul><ul><li>some list separately results from sponsored sites so you are reasonably clear what is there because it is sponsored & not </li></ul></ul><ul><ul><li>some have display-per-pay - showing first sites that paid most & do not even tell you that </li></ul></ul><ul><ul><li>some have pay per update of sites </li></ul></ul><ul><li>imperative to find sources that explain these models for different engines to know what is covered & what are you are getting </li></ul>© Chidanand Byahatti
    15. 15. limitations <ul><li>every search engine has limitation as to </li></ul><ul><ul><li>coverage </li></ul></ul><ul><ul><ul><li>meta engines just follow coverage limitations & have more of their own </li></ul></ul></ul><ul><ul><li>search capabilities </li></ul></ul><ul><ul><li>finding quality information </li></ul></ul><ul><li>some have compromised search with economics </li></ul><ul><ul><li>becoming little more than advertisers </li></ul></ul><ul><li>but search engines are also many times victims of spamindexing </li></ul><ul><ul><li>affecting what is included and how ranked </li></ul></ul>© Chidanand Byahatti
    16. 16. spamming a search engine <ul><li>use of techniques that push rankings higher than they belong is also called spamdexing </li></ul><ul><ul><li>methods typically include textual as well as link-based techniques </li></ul></ul><ul><ul><li>like e-mail spam, search engine spam is a form of adversarial information retrieval </li></ul></ul><ul><ul><ul><li>the conflicting goals of accurate results of search providers & high positioning by content page rank </li></ul></ul></ul>© Chidanand Byahatti
    17. 17. meta search engines <ul><li>meta engines search multiple engines </li></ul><ul><ul><li>getting combined results from a variety of engines </li></ul></ul><ul><li>do not have their own databases </li></ul><ul><ul><li>but have their own business models affecting results </li></ul></ul><ul><li>a number of techniques used </li></ul><ul><ul><li>interesting ones: clustering, statistical analyses </li></ul></ul>© Chidanand Byahatti
    18. 18. how to find a search engine? <ul><li>variety of resources that list or categorize engines </li></ul><ul><li> </li></ul><ul><ul><li>search for engines by topic, geography, reference </li></ul></ul><ul><ul><li>Search Engine Guide </li></ul></ul><ul><ul><li>engines categorized by topic; other engine information </li></ul></ul><ul><ul><li>Search Engine Colossus </li></ul></ul><ul><ul><li>international directory of search engines by country , topic from 198 countries and 61 territories; engines in choice of languages </li></ul></ul><ul><ul><li>Phil Bradley’s country based search engines </li></ul></ul><ul><ul><ul><li>over 2000 serach engines from countries all over the globe </li></ul></ul></ul>© Chidanand Byahatti
    19. 19. sample of meta engines - with organized results <ul><ul><li>Dogpile </li></ul></ul><ul><ul><ul><ul><li>results from a number of leading search engines; gives source, so overlap can be compared; (has also a (bad) joke of the day) </li></ul></ul></ul></ul><ul><ul><li>Surfwax </li></ul></ul><ul><ul><ul><ul><li>gives statistics and text sources & linking to sources; for some terms gives related terms to focus </li></ul></ul></ul></ul><ul><ul><li>Teoma </li></ul></ul><ul><ul><ul><ul><li>results with suggestions for narrowing; links resources derived; originated at Rutgers </li></ul></ul></ul></ul><ul><ul><li>Turbo10 </li></ul></ul><ul><ul><ul><ul><li>provides results in clusters; engines searched can be edited </li></ul></ul></ul></ul>© Chidanand Byahatti
    20. 20. meta search engines (cont.) <ul><li>Large directory </li></ul><ul><ul><li>Complete Planet </li></ul></ul><ul><ul><ul><li>directory of over 70,000 databases & specialty engines </li></ul></ul></ul><ul><li>Results with graphical displays </li></ul><ul><ul><li>Vivisimo </li></ul></ul><ul><ul><ul><li>clusters results; innovative </li></ul></ul></ul><ul><ul><li>Webbrain </li></ul></ul><ul><ul><ul><li>results in tree structure – fun to use </li></ul></ul></ul><ul><ul><ul><li>Kartoo </li></ul></ul></ul><ul><ul><ul><ul><li>results in display by topics of query </li></ul></ul></ul></ul>© Chidanand Byahatti
    21. 21. domain engines & catalogs <ul><li>cover specific subjects & topics </li></ul><ul><li>important tool for subject searches </li></ul><ul><ul><li>particularly for subject specialist </li></ul></ul><ul><ul><li>valued by professional searchers </li></ul></ul><ul><li>selection mostly hand-picked rather than by crawlers, following inclusion criteria </li></ul><ul><ul><li>often not readily discernable </li></ul></ul><ul><ul><li>but content more trustworthy </li></ul></ul>© Chidanand Byahatti
    22. 22. domain engines … sample <ul><ul><li>Open Directory Project </li></ul></ul><ul><ul><ul><li>large edited catalog of the web – global, run by volunteers </li></ul></ul></ul><ul><ul><li>BUBL LINK </li></ul></ul><ul><ul><ul><li>selected Internet resources covering all academic subject areas; organized by Dewey Decimal System – from UK </li></ul></ul></ul><ul><ul><li>Profusion </li></ul></ul><ul><ul><ul><li>search in categories for resources & search engines </li></ul></ul></ul><ul><ul><li>Resource Discovery Network – UK </li></ul></ul><ul><ul><ul><ul><li>“ UK's free national gateway to Internet resources for the learning, teaching and research community” </li></ul></ul></ul></ul>© Chidanand Byahatti
    23. 23. domain engines … sample <ul><li>Think Quest – Oracle Education Foundation </li></ul><ul><li>education resources, programs; web sites created by students </li></ul><ul><li>All Music Guide </li></ul><ul><li>resource about musicians, albums, and songs </li></ul><ul><li>Internet Movie Database </li></ul><ul><li>treasure trove of American and British movies </li></ul><ul><li>Genealogy links and surname search engines </li></ul><ul><ul><ul><li>well.. that is getting really specialized (and popular) </li></ul></ul></ul><ul><li>Daypop </li></ul><ul><ul><ul><li>searches the “living web” “The living web is composed of sites that update on a daily basis: newspapers, online magazines, and weblogs” </li></ul></ul></ul>© Chidanand Byahatti
    24. 24. science, scholarship engines …sample free access <ul><ul><li>Psychcrawler - Amer Psychological Association </li></ul></ul><ul><ul><ul><li>web index for psychology </li></ul></ul></ul><ul><ul><li>Entrez PubMed – Nat Library of Medicine </li></ul></ul><ul><ul><ul><ul><li>biomedical literature from MEDLINE & health journals </li></ul></ul></ul></ul><ul><ul><li>CiteSeer - NEC Research Center </li></ul></ul><ul><ul><ul><li>scientific literature, citations index; strong in computer science </li></ul></ul></ul><ul><ul><ul><li>Scholar Google </li></ul></ul></ul><ul><ul><ul><ul><li>searches for scholarly articles & resources </li></ul></ul></ul></ul><ul><ul><ul><li>Infomine </li></ul></ul></ul><ul><ul><ul><ul><li>scholarly internet research collections </li></ul></ul></ul></ul><ul><ul><ul><li>Scirus </li></ul></ul></ul><ul><ul><ul><ul><li>scientific information in journals & on the web </li></ul></ul></ul></ul>© Chidanand Byahatti
    25. 25. science, scholarship engines …sample commercial access <ul><li>an addition to freely accessible engines many provide search free but access to full text paid </li></ul><ul><ul><li>by subscription or per item </li></ul></ul><ul><ul><li>RUL provides access to these & many more: </li></ul></ul><ul><li>ScienceDirect </li></ul><ul><li>Elsevier: “world's largest electronic collection of science, technology and medicine full text and bibliographic information” </li></ul><ul><li>ACM Portal </li></ul><ul><li>Asoc. for Computing Machinery: access to ACM Digital Library & Guide to Computing </li></ul>© Chidanand Byahatti
    26. 26. where to find out? <ul><li>information about search engines in sources that have updates, news, tips for searching and more – a MUST for searchers : </li></ul><ul><ul><li>Search Engine Watch </li></ul></ul><ul><ul><ul><li>ratings, news, statistics, charts, explanations, tutorials </li></ul></ul></ul><ul><ul><li>Search Engine Showdown </li></ul></ul><ul><ul><ul><li>“ The users’ guide to web searching” - run by a librarian, news links, ratings </li></ul></ul></ul><ul><ul><li>Virtual Chase </li></ul></ul><ul><ul><ul><ul><li>a site about “Teaching Legal Professionals How To Do Research;,” this section has very good tips and links for consideration of quality on the web </li></ul></ul></ul></ul>© Chidanand Byahatti
    27. 27. where? …. <ul><ul><li>SiteLines </li></ul></ul><ul><ul><ul><li>a blog, written by Rita Vine, a professional librarian, & web search trainer; many evaluations in archive </li></ul></ul></ul><ul><ul><li>ResourceShelf </li></ul></ul><ul><ul><ul><li>“ Resources and News for Information Professionals,” edited by Gary Price, a librarian & author of Invisible Web – has extensive archive </li></ul></ul></ul><ul><ul><li>WebsearchAbout </li></ul></ul><ul><ul><ul><li>not evaluative, but provides news, capabilities, sources, articles about web searching </li></ul></ul></ul>© Chidanand Byahatti
    28. 28. art of searching search engines © Chidanand Byahatti
    29. 29. part 2: digital libraries © Chidanand Byahatti
    30. 30. definition <ul><li>digital libraries are viewed from several perspectives </li></ul><ul><ul><li>technical : “Digital library is a managed collection of information, with associated services, where information is stored in digital format and accessible over a network.” (Arms, 2000) </li></ul></ul><ul><ul><li>institutional: “Digital libraries are organizations that provide the resources, including the specialized staff, to select, structure, offer intellectual access to, interpret, distribute, preserve the integrity of, and ensure the persistence over time of collections of digital works so that they are readily and economically available for use by a defined community or set of communities.” (Waters, 1998) </li></ul></ul>
    31. 31. a bit of context <ul><li>short but volatile history </li></ul><ul><ul><li>research & development took of by start/mid 1990’s </li></ul></ul><ul><ul><li>in the next decade phenomenal growth worldwide </li></ul></ul><ul><ul><li>large investment in research & building </li></ul></ul><ul><li>number of communities involved </li></ul><ul><ul><li>computer science, primarily in research </li></ul></ul><ul><ul><li>many subjects: digital libraries in their domain </li></ul></ul><ul><ul><li>library & information science: operations, studies of users, use, usability </li></ul></ul><ul><li>number of types emerged </li></ul>
    32. 32. libraries & digital resources <ul><li>libraries (particularly research, academic & special) directed massive funding toward such resources </li></ul><ul><ul><li>electronic journals </li></ul></ul><ul><ul><li>databases </li></ul></ul><ul><ul><li>catalogs </li></ul></ul><ul><ul><li>digitization of parts of collection </li></ul></ul><ul><li>thus becoming in effect digital libraries – or more accurately hybrid libraries </li></ul><ul><ul><li>with graphic and digital versions or types of resources </li></ul></ul>
    33. 33. emphasis here <ul><li>on large academic or research digital libraries that also are related to searching </li></ul><ul><ul><li>provide search capabilities or access to search engines </li></ul></ul><ul><ul><li>provide electronic journals that provide full text of articles after a search </li></ul></ul><ul><li>such libraries have become also search portals of sort, essential for their users </li></ul><ul><ul><li>in education, research & related activities </li></ul></ul>
    34. 34. sample <ul><li>New York Public Library Digital </li></ul><ul><li>“ NYPL Digital is your gateway to The New York Public Library’s rare and unique collections in digitized form.” Includes access to searchable databases </li></ul><ul><li>U California Berkeley Digital Library SUNsite </li></ul><ul><ul><li>“ builds digital collections and services while providing information and support to digital library developers worldwide. </li></ul></ul><ul><li>The British Library </li></ul><ul><ul><li>“ The world’s knowledge.” Includes “Services fro library and information Professionals.” </li></ul></ul><ul><li>Los Angeles Public Library Kids’ Path </li></ul><ul><ul><li>resources for children; search through directory </li></ul></ul>
    35. 35. sample … <ul><li>New Zealand Digital Library </li></ul><ul><ul><li>searching of a number of digital collections, including humanity development library </li></ul></ul><ul><li>Research Library Group </li></ul><ul><ul><li>“ RLG is a not-for-profit organization of over 150 research libraries, archives, museums, and other cultural memory institutions.” Includes links to a number of searchable collections </li></ul></ul><ul><li>Public Library of Science </li></ul><ul><ul><li>“ PLoS is a nonprofit organization of scientists and physicians committed to making the world's scientific and medical literature a public resource.” Publishes open access journals </li></ul></ul>
    36. 36. Rutgers libraries – digital components <ul><li>strategic planning in developing digital access </li></ul><ul><li>rich & complex content of digital resources </li></ul><ul><ul><li>several hundred indexes & databases for searching </li></ul></ul><ul><ul><li>some 20,000 electronic journals </li></ul></ul><ul><ul><li>thousand & more digital reference sources </li></ul></ul><ul><ul><li>subject research guides </li></ul></ul><ul><ul><li>Searchpath & other tutorials </li></ul></ul><ul><ul><li>electronic reserve </li></ul></ul><ul><li>affected teaching, learning, research by the whole community </li></ul>
    37. 37. some critical issues for searching <ul><li>no way yet to do federated searching in digital libraries </li></ul><ul><ul><li>to search several indexes at the same time </li></ul></ul><ul><ul><li>each source has to be searched separately </li></ul></ul><ul><ul><ul><li>most have very different search features, capabilities </li></ul></ul></ul><ul><li>finding items in indexes does not mean that always able to get full text </li></ul><ul><li>thus, searching time-consuming, chaotic </li></ul>
    38. 38. where to find out? <ul><li>information about digital libraries </li></ul><ul><ul><li>LibWeb U California, Berkeley </li></ul></ul><ul><ul><li>“ lists currently over 7200 pages from libraries in over 125 countries” </li></ul></ul><ul><ul><li>Digital Library Federation </li></ul></ul><ul><ul><li>“ a consortium of libraries and related agencies that are pioneering the use of electronic-information technologies to extend their collections and services” </li></ul></ul><ul><ul><li>D-Lib Magazine </li></ul></ul><ul><ul><li>“ a solely electronic publication with a primary focus on digital library research and development, including but not limited to new technologies, applications, and contextual social and economic issues” </li></ul></ul>
    39. 39. where? … <ul><li>Ariadne (UK) </li></ul><ul><ul><li>“ to report on information service developments and information networking issues worldwide, keeping the busy practitioner abreast of current digital library initiatives” </li></ul></ul><ul><li>Information Technology and Libraries </li></ul><ul><ul><li>ALA publication; “related to all aspects of libraries and information technology, including digital libraries” </li></ul></ul><ul><li>Journal of Digital Information </li></ul><ul><ul><li>“ Publishing papers on the management, presentation and uses of information in digital environments” </li></ul></ul><ul><li>Biblio Tech Review </li></ul><ul><ul><li>“ Information Technology for Libraries” – monthly news and review magazine </li></ul></ul>
    40. 40. in conclusion <ul><li>search engines are great but you have to KNOW what is under the hood </li></ul><ul><ul><li>as to coverage, business model, search features, outputs … </li></ul></ul><ul><ul><li>they are NOT for every kind of information need </li></ul></ul><ul><li>digital libraries are great for searching but you have to KNOW requirements for searching different resources that are included </li></ul><ul><ul><li>there is no federated searching as yet, or for the time to come </li></ul></ul>
    41. 41. art of searching digital libraries
    42. 42. and rewards …