Web Search 101

72,466 views

Published on

Finding Lesson Plans, Activities, Songs, Games, and Conducting Serious Academic Research
MADE EASIER, FASTER AND MORE ACCURATE

Published in: Education, Technology, Design
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
72,466
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
32
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Web Search 101

  1. 1. Web Search 101Finding Lesson Plans, Activities, Songs, Games, andConducting Serious Academic ResearchMADE EASIER, FASTER AND MORE ACCURATE Developed By William Tweedie
  2. 2. October 2011 & 2012Table of ContentsPreface....................................................................................................................... 4Objectives .................................................................................................................. 5Materials: ................................................................................................................... 5Timing: ....................................................................................................................... 5Procedure................................................................................................................... 6Part 1 – The Surface Web, Search Engines and Directories...................................... 6A. Activating Prior Knowledge..................................................................................... 6B. Search Engine – An online (Internet) World Wide Web search program................7D. Search Queries ..................................................................................................... 8FRAMING YOUR SEARCH STRATEGY.................................................................... 8ACTIVITY:.................................................................................................................. 9E. Basic Boolean Search Operators (AND, OR, NOT).............................................. 10F. Search Tips, Tricks and Techniques..................................................................... 10G. Wrap-up of Part 1................................................................................................. 10Part 2 – The Hidden Web......................................................................................... 10The Internet, World Wide Web and the Hidden Web................................................ 11 Scratching the Surface and Digging Deep – Layers of the Web............................ 12 Education.............................................................................................................. 14Three Types of Search Engines .............................................................................. 18 Crawler-based search engines ............................................................................. 18 Human-powered directories ................................................................................. 19 Hybrid search engines ......................................................................................... 20 Table of Search Engine Features ......................................................................... 20 How do Search Engines Work?............................................................................ 22 Table of Directory Features................................................................................... 23 Subject Directories (Contain Databases), and Portals ......................................... 24 How to Find Subject-Focused Directories for a Specific Topic, Discipline, or Field .............................................................................................................................. 24 What Are "Meta-Search" Engines? How Do They Work? ..................................... 25 Are "Smarter" Meta-Searchers Still Smarter?....................................................... 25 Better Meta-Searchers.......................................................................................... 25 2
  3. 3. Meta-Search Engines for SERIOUS Deep Digging .............................................. 26Search Basics: Constructing a Google Query .......................................................... 26 Where does the term Boolean originate from?...................................................... 27 Is Boolean Search Complicated?.......................................................................... 27 Boolean Search And / Or / Not.............................................................................. 27 Boolean Search Examples Boolean Connectors:.................................................. 28 Interactive Text Equivalent.................................................................................... 28 How the Search Engines Differ............................................................................. 30 Search Engine Syntax & Features Comparison Chart ......................................... 30 Some Search Tips, Tricks, & Techniques ............................................................ 33 Invisible or Deep Web: What it is, How to find it, and its inherent ambiguity.........34 Why isnt everything visible?................................................................................. 34 How to Find the Invisible Web .............................................................................. 35 The Ambiguity Inherent in the Invisible Web: ....................................................... 35 Want to learn more about the Invisible Web?........................................................ 35 10 Search Engines to Explore the Invisible Web................................................... 36 How do we get to this mother lode of information?................................................ 36The Invisible Web Databases................................................................................... 41Dictionaries, Translators, & Other Language & Reference Tools ............................. 44Web directories ........................................................................................................ 48Internet Gateways, Jumplists, & Specialized Link Collections................................... 48 Finding Jumplists & Gateways.............................................................................. 49www.invisible-web.net.............................................................................................. 49 Saving pages with Microsoft Internet Explorer ..................................................... 50 Peer-to-Peer Computing ...................................................................................... 50 Education ............................................................................................................. 50 Subject-orientated search services....................................................................... 52 Additional information about search engines, their use, and how they find resources.............................................................................................................. 52 Data services requiring registration ...................................................................... 52 Data services with unrestricted access................................................................. 54 Search Engines .................................................................................................... 55 Subject-orientated search services....................................................................... 56 Dictionaries and Thesauri .................................................................................... 57 Reference Works ................................................................................................. 58General Tips for Searching the Web......................................................................... 60 3
  4. 4. Carefully Select Your Search Terms..................................................................... 60 Framing your search strategy............................................................................... 60International Educational Research Links................................................................. 62Education databases................................................................................................ 64Teaching websites.................................................................................................... 64Journals.................................................................................................................... 65Newsletters............................................................................................................... 65New Educational Technology Standards for Teachers and Students.......................65 NETS for Teachers 2008...................................................................................... 65 NETS for Students 2007....................................................................................... 67 Glossary ............................................................................................................... 69 A to Z Computer/Internet Terms............................................................................ 69Appendix A............................................................................................................... 74PrefaceThe Internet and its World Wide Web are growing, developing and adding newfeatures at an explosive exponential rate. As you read this there are newtechnologies being developed and implemented to make ‘surfing’ the Internet foruseful information of all types easier and more accurate, from the traditionaldocument to flash videos and file types previously inaccessible These types of pagesused to be invisible but can now be found in most search engine results: • Pages in non-HTML formats (pdf, Word, Excel, PowerPoint), now converted into HTML. • Script-based pages, whose URLs contain a ? or other script coding. 4
  5. 5. • Pages generated dynamically by other types of database software (e.g., Active Server Pages, Cold Fusion). These can be indexed if there is a stable URL somewhere that search engine crawlers can find.The "visible web" is what you can find using general web search engines. Its alsowhat you see in almost all subject directories. The "invisible web" is what youcannot find using these types of tools.Search engines crawlers and indexing programs have overcome many of thetechnical barriers that made it impossible for them to find "invisible" web pages.Computer robot programs, referred to sometimes as "crawlers" or "knowledge-bots"or "knowbots" that are used by search engines to roam the World Wide Web via theInternet, visit sites and databases, and keep the search engine database of webpages up to date. They obtain new pages, update known pages, and delete obsoleteones. Their findings are then integrated into the "home" database. Most large searchengines operate several robots all the time. Even so, the Web is so enormous that itcan take six months for spiders to cover it, resulting in a certain degree of "out-of-datedness" (link rot) in all the search engines.http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/Glossary.htmlTherefore this is truly just a starting point for the serious researcher whether inacademia or as a consumer of goods and services.ObjectivesIn this brief overview we will look at and explore the elements that make for effectiveresearch on the Internet. 1. You will learn the Internet is composed of the “Surface Web” and the “Deep or Hidden Web. 2. You will learn how to access information on both in the most expedient way through Search Engines, Meta-search engines and other Internet tools. a. You will learn what Search Engines are and the various types available. b. You will learn what Subject Directories, Portals, and Databases are. 3. You will learn how to construct a search strategy. 4. You will learn the basics of Boolean parameters which narrow search results. 5. You will be provided special resources for academic research.Materials:This workshop needs to be conducted in a computer lab with very good Internetaccess. Participants will follow specific areas of this reference book throughout theworkshop.These areas can be changed according to the needs of the group. This referencebook is as comprehensive a guide as possible at the time of production.Timing: 5
  6. 6. This workshop is designed to give a brief introduction to the complex world of the‘Surface’ and ‘Hidden’ Webs with a focus on helping make searches more effectiveand productive. Normal time allotted is 2 hours but it can be extended according totime availability and the group’s level of expertise and interest. It is fully expected thatparticipants will regularly refer to this book and refine their search skillsindependently.DISCLAIMER: Changes on the Internet and in the Hidden Web occur at a rapid paceso some of the search engines, sites, directories and databases may no longer beavailable at the web addresses provided and some may no longer exist. Be preparedto move quickly to the next point of interest. Broken links and inaccessible web-sitescan be researched at a later date.ProcedureIt is preferable to distribute this reference book well in advance of the workshop soparticipants can familiarize themselves with the terms, content, and explore a few ofthe sites.Part 1 – The Surface Web, Search Engines and DirectoriesA. Activating Prior KnowledgeACTIVITY: PRIME TASK: Q & A1. The Surface Web (WWW) – What is it composed of? 6
  7. 7. Write as many types of information or components of the World Wide Web as youcan.Time: 10 minutes___________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________2. How can you access this information?Write as many ways as you can?Time: 10 minutes____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________3. How many Search Engines can you name? What is your favorite search engine?Do you use more than one?Write your answers.Time: 5 minutes____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________4. How often do you use a search engine in a day? Week? What do you search for?How long do you spend per search? Do you get the results you need or want?Write your answers.Time: 5 minutes_______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________B. Search Engine – An online (Internet) World Wide Web search program. 7
  8. 8. 1. There are 3 types of Search Engine:a). Crawler-based (e.g. Google) – these create their listings automatically throughspecial programs that crawl or spider the web which follow links in web pages italready has to its collection of sources, retrieve information found in index servers ofweb-sites (containing key words) then send it back to the engine’s doc servers whichretrieve the entire document and create snippets to describe the document and whichcontain the key words that might be the subject of a search query. – Very fast.b). Human Powered Directories (e.g. (Open Directory Project) – gets its informationfrom visitor submissions which include a short description which is the source of anykey words in a search. – Also fast.c). Hybrid Engines – combine results from the first two though one engine may havea preference over the other. – Depends on the engine.Search engines rely on their own ‘cache’ of web pages they have harvested butwhen accessed (clicked) you are taken to the source’s latest page. If a page is neverlinked it cannot be indexed.The pages indexed are visible pages only. We’ll look at the Invisible web in Part 2.2. How many search engines do you think there are? 80% of web pages in a majorsearch engine exist only on that engine; so, it is worth taking a look at some of theothers for a ‘second opinion’.ACTIVITY:Chose a topic and search for it on Google or www.DuckDuckGo.com. Then do thesame search on Exalead (www.exalead.com/search/. Compare the number of resultsand the sources of these results.Time: 20 minutesC. Meta – search Engines – combine the results of many search engines.(www.dogpile.com), (www.surfwax.com)ACTIVITY:Use the same search term as in the previous activity and compare the results again.Time: 10 minutesD. Search QueriesFRAMING YOUR SEARCH STRATEGYTo get a successful search result, you must ask the right search question. Framing agood question requires you to think strategically about exactly what you need."By taking the time to identify key phrases and visualize the ideal answer, you will bemore likely to recognize that answer when you find it online." (Nora Paul)Her guidelines are based on the standard journalist approach of "who, what, when,where, why and how" reporting and include these tips, among others: 8
  9. 9. Who: • Who is the research about: a politician, a businessperson, a scientist, a criminal? • Who is key to the topic you are researching? Are there any recognized experts or spokespersons you should know about?What: • What kind of information do you need: statistics, sources, background? • What kind of research are you doing: an analysis, a background report, a follow-up? • What would the ideal answer look like?When: • When did the event being researched take place? This will help determine the source to use, particularly, which information source has resources dating far enough back. • Do you know when you should stop searching?Where: • Where did the event you are researching take place? • Where have you already looked for information? • Where might there have been previous coverage: newspapers, broadcasts, trade publications, court proceedings, discussions?Why: • Why do you need the research: seeking a source to interview, surveying a broad topic, pinpointing a fact? • Why must you have the research: to make a decision, to corroborate a premise?How: • How much information do you need: a few good articles for background, everything in existence on the topic, just the specific fact? • How are you going to use the information: for an anecdote, for publication?"Today," Schlein says, "so much data is available that, without a plan, you can easilyfind yourself swimming in an ocean of information…A good, clear question will saveyou hours of work." Find Pauls complete checklist and other good searchsuggestions from Schlein in Find It Online (Tempe, AZ: Facts on Demand Press,2004).ACTIVITY:Reframe the above criteria for research on an academic topic.Time: 15 minutes 9
  10. 10. E. Basic Boolean Search Operators (AND, OR, NOT)ACTIVITY:Complete the 4 activities on the “Boolify” worksheets.Time: 30 minutesSee Appendix AF. Search Tips, Tricks and TechniquesSee page 25 belowTime: 5 minutesG. Wrap-up of Part 1Reflection and FeedbackPart 2 – The Hidden WebLook at 10 Search Engines to Explore the Invisible Web on pages 28 – 33Experiment and explore some of the Web Portals, Directories and DatabasesA. List the Categories you find in eachB. Try Boolean searching for a specific topic you currently are researching for apaper or lessonTime: 1 hour 10
  11. 11. You may make notes below:The Internet, World Wide Web and the Hidden WebThe Internet is a network of computers connected together (External net) to shareinformation with others through means of the World Wide Web (WWW).World Wide Web (WWW) is part of the Internet where text and graphics are placedtogether and where information can be easily accessed and shared with others toform a Web Page along with links to different documents or other places (Hypertextor Hyperlinks). - From the Glossary Section at the end of this reference book 11
  12. 12. The World Wide Web is also known as the ‘Surface Web’ – available to anyone whohas a computer and internet connection.Scratching the Surface and Digging Deep – Layers of the Web"The Invisible Web"By Chris ShermanTheres a big problem with most search engines, and its one many people arenteven aware of. The problem is that vast expanses of the Web are completelyinvisible to general purpose search engines like AltaVista, HotBot and Google. Evenworse, this "Invisible Web" is in all likelihood growing significantly faster than thevisible Web youre familiar with.So what is this Invisible Web and why arent search engines indexing it? To answerthis question, its important to first define the "visible" Web, and describe how searchengines compile their indexes.The Web was created a little over twenty-two years ago by Tim Berners-Lee, aresearcher at the European Organization for Nuclear Research CERN -The name isderived from the acronym for the French Conseil Européen pour la RechercheNucléaire a high-energy physics laboratory in Switzerland.Berners-Lee designed the Web to be platform-independent, so that researchers atCERN could share materials residing on any type of computer system, avoidingcumbersome and potentially costly conversion issues. To enable this cross-platformcapability, Berners-Lee created HTML, or HyperText Markup Language - essentiallya dramatically simplified version of SGML (Standard Generalized Markup Language).HTML documents are simple: they consist of a "head" portion, with a title andperhaps some additional meta-data describing the document, and a "body" portion,the actual document itself. The simplicity of this format makes it easy for searchengines to retrieve HTML documents, index every word on every page, and storethem in huge databases that can be searched on demand.Whats less easy is the task of actually finding all the pages on the Web. Searchengines use automated programs called spiders or robots to "crawl" the Web andretrieve pages. Spiders function much like a hyper-caffeinated Web browser - theyrely on links to take them from page to page.Crawling is a resource-intensive operation. It also puts a certain amount of demandon the host computers being crawled. For these reasons, search engines will oftenlimit the number of pages they retrieve and index from any given Web site. Itstempting to think that these unretrieved pages are part of the Invisible Web, but theyarent. They are visible and indexable, but the search engines have made aconscious decision not to index them.In recent months, much has been made of these overlooked pages. Many of themajor engines are making serious efforts to include them and make their indexesmore comprehensive. Unfortunately, the engines have also discovered through their"deep crawls" that theres a tremendous amount of duplication and spam on the Web.Current estimates put the Web at about 1.2 to 1.5 billion indexable pages. BothInktomi and AltaVista have claimed that theyve spidered most of these documents,but have been forced to cull their indexes to cope with duplicates and spam. Inktomi 12
  13. 13. puts the size of the distilled Web at about 500 million pages; AltaVista at about 350million.But these numbers dont include Web pages that cant be indexed, or informationthats available via the Web but isnt accessible by the search engines. This is thestuff of the Invisible Web.Why cant some pages be indexed? The most basic reason is that there are no linkspointing to a page that a search engine spider can follow. Or, a page may be madeup of data types that search engines dont index - graphics, CGI scripts, Macromediaflash or PDF files, for example.But the biggest part of the Invisible Web is made up of information stored indatabases. When an indexing spider comes across a database, its as if it has runsmack into the entrance of a massive library with securely bolted doors. Spiders canrecord the librarys address, but can tell you nothing about the books, magazines orother documents it contains.There are thousands - perhaps millions - of databases containing high-qualityinformation that are accessible via the Web. But in order to search them, youtypically must visit the Web site that provides an interface to the database. Theadvantage to this direct approach is that you can use search tools that werespecifically designed to retrieve the best results from the database. Thedisadvantage is that you need to find the database in the first place, a task thesearch engines may or may not be able to help you with.Another problem is that content in some databases isnt designed to be directlysearchable. Instead, Web developers are taking advantage of database technologyto offer customized content thats often assembled on the fly. Search engine resultspages are an example of this type of dynamically generated content - so are serviceslike My Excite and My Yahoo. As Web sites get more complex and users demandmore personalization, this trend toward dynamically generated content willaccelerate, making it even harder for search engines to create comprehensive Webindexes.In a nutshell, the Invisible Web is made up of unindexable content that searchengines either cant or wont index. Its a huge part of the Web, and its growing.Fortunately, there are several reasonably thorough guides to the Invisible Web.Gary Price, Reference Librarian at the Gelman Library at George WashingtonUniversity, is considered one of the foremost authorities on online databases andother invaluable search resources on the Invisible Web.http://www.resourceshelf.com/Prices List of Lists (LOL) was started around 1998 and maintained by Gary Price formany years. The LOL grew, and Garys commitment to other projects and speakingengagements made the upkeep of the LOL impossible. In late 2000, Garyapproached Trip Wyckoff, of Specialissues.com, about taking over the upkeep andexpansion of the LOL. By 2002 the online database and structure to maintain andorganize the LOL was in place and in October 2002 the LOL was transferred towww.Specialissues.com."By the way, do not mistake an interest in the Invisible Web as a slam on the generalsearch engines because it is NOT," says Price. "General search tools are still 100%essential for accessing material on the Internet." 13
  14. 14. One of the largest gateways to the Invisible Web is the aptly named Invisibleweb.com<http://www.invisibleweb.com> from Intelliseek."Invisible Web sources are critical because they provide users with specific, targetedinformation, not just static text or HTML pages," says Sundar Kadayam, CTO andCo-Founder, Intelliseek."InvisibleWeb.com is a Yahoo-like directory. It is a high quality, human edited andindexed, collection of highly targeted databases that contain specific answers tospecific questions," says Kadayam.Intelliseek also makes BullsEye, a desktop based metasearch engine that can alsoaccess many of the sites included in InvisibleWeb.com. More information can befound at <http://www.intelliseek.com/prod/bullseye.htm>.A good librarian would not start looking for a phone number (specialized, InvisibleWeb info) by searching the Encyclopaedia Britannica (general knowledge resource),"says Price. "Both professional and casual searchers should at least be aware thatthey could be missing some information or wasting time finding what could be foundmore easily if the right tool for the job is easily accessible. This is very similar to agood reference librarian “knowing the major reference tools in his or her collection.Chris Sherman is the Web Search Guide for About.com. - Extracted from http://web.freepint.com/go/newsletter/64Gary Prices List of ListsAgriculture, Forestry, Fishing and Hunting, Petroleum & Mining, Utilities,Construction, Manufacturing, Wholesale Trade, Retail Trade, Transportation andWarehousing Information, Finance & Insurance, Real Estate Rental & Leasing,Professional, Scientific, and Technical Services, Business & Industry Management,Administrative & Support Services, Education, Health Care and Social AssistanceArts, Entertainment and Recreation, Accommodation and Food Services, Repairs,Religious, Civic, Professional, and Similar Organizations, Public Administration &Public Works, Country/Region Specific, Executives… - extracted from http://www.specialissues.com/lol/EducationMagazine Article YearAmerican School & Top 10 Issue (biggest, best and most popular in education 2005University Magazine facilities and business)American School & Top 10 Issue (biggest, best and most popular in education 2003University Magazine facilities and business) 14
  15. 15. American School & Top 100 School Districts and Colleges Facilities (ranked 2003University Magazine by size of facilities)American School & Top 10 Issue (biggest, best and most popular in education 2004University Magazine facilities and business)American School & Top 100 School Districts and Colleges Facilities (ranked 2004University Magazine by size of facilities)American School & Top 100 School Districts and Colleges Facilities (ranked 2002University Magazine by size of facilities)American School & Top 10 Issue (biggest, best and most popular in education 2006University Magazine facilities construction, operations and management)American School & Top 100 School Districts and Colleges Facilities (ranked 2006University Magazine by size of facilities)Business Week(Global edition) Best Business Schools (ranking and review of the worlds 2002(formerly North leading business schools) (1986)America edition)Business Week(Global edition) Best Executive Education/Business Schools (ranking and 2005(formerly North review of the worlds leading business schools) (1986)America edition)Business Week(Global edition) Best Executive Education/Business Schools (ranking and 2004(formerly North review of the worlds leading business schools) (1986)America edition)Business Week(Global edition) Young Professionals: Best Undergrad B-Schools 2007(formerly NorthAmerica edition)Business Week(Global edition) Young Professionals: Best Undergrad B-Schools 2008(formerly NorthAmerica edition) MBA Report (annual look at master of business administration education, weve decided to forgo ourCanadian Business traditional ranking of Canadas MBA programs and instead 2003 examine the ever-increasing variety of choices Canadian schools are offering) (1991)Chief Executive Annual Best Business Schools for Executive Education 2006 15
  16. 16. (2004) Almanac of Higher Education (statistical/demographicChronicle of Higher databook on education covering four major topical areas: 2002Education, The students, faculty and staff, resources, and institutions) (separate issue)Expansion Metro With the Best Public Education Systems 2005Management College Census (2001 performance report for 100 top self-Foodservice Director 2002 op colleges) School Census (performance report for top 100 schoolFoodservice Director 2002 districts) Best Business Schools (ranked by return on investment)Forbes 2008 (2001, biennial) Best Business Schools (ranked by return on investment)Forbes 2007 (2001, biennial)Fortune Top 50 MBA Employers 2007Fortune(InternationalVersion: Asia, 20 Great Employers for New Grads 2007Europe, LatinAmerica)Fortune Small 10 Cool Colleges for Entrepreneurs 2006Business: FSBFortune Small Best Colleges for Entrepreneurs 2007Business: FSBMacleans Canadas Best Schools 2004Macleans Annual University Ranking (1990) 2004 Scholastic Top 10 (top 10 universities ranked by the quality and variety of workshops, conferences and shortMeat & Poultry 2004 courses available at universities throughout the U.S.) (2000) Top 10 Universities (top 10 universities ranked by the quality and variety of workshops, conferences and shortMeat & Poultry 2007 courses available at universities throughout the U.S.) (2000) 16
  17. 17. National Law JournalNLJ Law Schools Report 2008Progress Magazine The High School Report Card (the AIMS Ranking of High(CA) (formerly School Performance in Every District in Atlantic Canada 2009Atlantic Progress and Maine) (2002)Magazine)Quirks Marketing University Degree Programs in Marketing Research 2008Research ReviewSchool Bus Fleet Statistics & Top Rankings 2003School Bus Fleet Top 50 Contractor Fleets 2002School Bus Fleet Top 100 School District Fleets 2002School Planning & Leading the Way: Americas Fastest Growing Districts 2007ManagementTechnology Review University Research Scorecard (ranking and analysis of(formerly MIT intellectual property and research revenues and spin-offs, 2002Technology Review) includes profiles of hot start-ups)U.S. News and Best Graduate Schools Guide 2002World ReportU.S. News and Americas Best Colleges Guide 2002World ReportU.S. News and Colleges (1,400+ schools) 2002World ReportU.S. News and Community Colleges (1,200+ schools) 2002World ReportU.S. News and Corporate E-learning vendors (600+ providers) 2002World ReportU.S. News and E-learning courses and degrees (1,000+ institutions) 2002World ReportU.S. News and Graduate Schools (1,000+ programs) 2002World ReportU.S. News and Scholarships (600,000+ awards) 2002World ReportU.S. News and Best Graduate Schools 2005World ReportU.S. News and Best Colleges 2004 17
  18. 18. World ReportVirginia Business Special Report: Business Schools Directory 2006Virginia Business Private Schools Directory 2006Virginia Business Special Report: Community Colleges Directory 2006Virginia Business Education: Engineering/IT Schools Directory 2006Three Types of Search EnginesThe term "search engine" is often used generically to describe crawler-based searchengines, human-powered directories, and hybrid search engines. These types ofsearch engines gather their listings in different ways, through crawler-basedsearches, human-powered directories, and hybrid searches.Crawler-based search enginesCrawler-based search engines, such as Google (http://www.google.com), create theirlistings automatically. They "crawl" or "spider" the web, then people search throughwhat they have found. If web pages are changed, crawler-based search engineseventually find these changes, and that can affect how those pages are listed. Pagetitles, body copy and other elements all play a role.The life span of a typical web query normally lasts less than half a second, yetinvolves a number of different steps that must be completed before results can bedelivered to a person seeking information. The following graphic (Figure 1) illustratesthis life span (from http://www.google.com/corporate/tech.html): 18
  19. 19. 1. The web server sends the query to the index3. The search results are servers. The content inside the index servers isreturned to the user in a similar to the index in the back of a book - itfraction of a second. tells which pages contain the words that match the query. 2. The query travels to the doc servers, which actually retrieve the stored documents. Snippets are generated to describe each search result.Human-powered directoriesA human-powered directory, such as the Open Directory Project(http://www.dmoz.org/about.html) depends on humans for its listings. (Yahoo!, whichused to be a directory, now gets its information from the use of crawlers.) A directorygets its information from submissions, which include a short description to thedirectory for the entire site, or from editors who write one for sites they review. Asearch looks for matches only in the descriptions submitted. Changing web pages,therefore, has no effect on how they are listed. Techniques that are useful forimproving a listing with a search engine have nothing to do with improving a listing ina directory. The only exception is that a good site, with good content, might be morelikely to get reviewed for free than a poor site. 19
  20. 20. Hybrid search enginesToday, it is extremely common for crawler-type and human-powered results to becombined when conducting a search. Usually, a hybrid search engine will favor onetype of listings over another. For example, MSN Search (http://www.imagine-msn.com/search/tour/moreprecise.aspx) is more likely to present human-poweredlistings from LookSmart (http://search.looksmart.com/). However, it also presentscrawler-based results, especially for more obscure queries.Recommended Search EnginesUC Berkeley - Teaching Library Internet WorkshopsGoogle is currently the most used search engine. It has one of the largest databasesof Web pages, including many other types of web documents (blog posts, wiki pages,group discussion threads and document formats (e.g., PDFs, Word or Exceldocuments, PowerPoints). Despite the presence of all these formats, Googlespopularity ranking often places worthwhile pages near the top of search results.Google alone is not always sufficient, however. Not everything on the Web is fullysearchable in Google. Overlap studies show that more than 80% of the pages in amajor search engines database exist only in that database. For this reason, getting a"second opinion" can be worth your time. For this purpose, we recommend Yahoo!Search or Exalead. We do not recommend using meta-search engines as yourprimary search tool.Table of Search Engine FeaturesSome common techniques will work in any search engine. However, in this verycompetitive industry, search engines also strive to offer unique features. When indoubt, look for "help", "FAQ", or "about" links. Search Google Yahoo! Search Exalead Engine www.google.com search.yahoo.com www.exalead.com/search/ Links to Google help Yahoo! help Exalead help and FAQ helpSize, type IMMENSE. Size not HUGE. Claims over LARGE. Claims to have disclosed in any way 20 billion total "web over 8 billion searchable that allows objects." pages. comparison. Probably the biggest.Noteworthy PageRank™ system Shortcuts give Truncation lets you search features includes hundreds of quick access to by the first few letters of a factors, emphasizing dictionary, word. pages most heavily synonyms, patents, Proximity search lets you linked from other traffic, stocks, find terms NEAR each pages. encyclopedia, and other or NEXT to each 20
  21. 21. Many additional more. other. databases including Thumbnail page previews. Book Search, Scholar Extensive options for (journal articles), Blog refining and limiting your Search, Patents, search. Images, etc.Phrase Enclose phrase in Enclose phrase in Enclose phrase in "doublesearching "double quotes". "double quotes". quotes".Boolean Partial. AND assumed Accepts AND, OR, Partial. AND assumedlogic between words. NOT or AND NOT. between words. Capitalize OR. Must be Capitalize OR. ( ) accepted but not capitalized. ( ) accepted. required. ( ) accepted but not See Web Search Syntax In Advanced Search, required. for more options. partial Boolean available in boxes.+Requires/ - excludes - excludes - excludes-Excludes + retrieves "stop + will allow you to + retrieves "stop words" words" (e.g., +in) search common (e.g., +in) words: "+in truth"Sub- The search box at the The search box at The search box at the topSearching top of the results page the top of the of the results page shows shows your current results page shows your current search. Modify search. Modify this your current this (e.g., add more terms (e.g., add more terms search. Modify this at the end.) at the end.) (e.g., add more terms at the end.)Results Based on page Automatic Fuzzy Popularity rankingRanking popularity measured AND. emphasizes pages most in links to it from other heavily linked from other pages: high rank if a pages. lot of other pages link to it. Fuzzy AND also invoked. Matching and ranking based on "cached" version of pages that may not be the most recent version.Field link: link: intitle: 21
  22. 22. limiting site: site: inurl: intitle: intitle: site: inurl: inurl: after:[time period] Offers U.S.Govt url: before:[time period] Search and other hostname: (For details, click on special searches. (Explanation of "Advanced search") Patent search. these distinctions.)Truncation, No truncation within Neither. Search Use * Stemming words. Automatically with OR as in example: messag* ) stems some words. Google. Search variant endings and synonyms separately, separating with OR (capitalized): airline OR airlines Use * or _ as wildcards substituting for initials or words: sickle * anemia george _ bushLanguage Yes. Major Yes. Major Extensive language and Romanized and non- Romanized and geographic options. Use Romanized languages non-Romanized "Advanced Search". in Advanced Search. languages.Translation Yes, in "Translate this Available as a Yes, in "Translate this page" link following separate service. page" link following some some pages. To and pages. sometimes from English and major European languages and Chinese, Japanese, Korean. Ues its own translation software with user feedback.How do Search Engines Work?Search engines do not really search the World Wide Web directly. Each onesearches a database of web pages that it has harvested and cached. When you usea search engine, you are always searching a somewhat stale copy of the real webpage. When you click on links provided in a search engines search results, youretrieve the current version of the page. 22
  23. 23. Search engine databases are selected and built by computer robot programs calledspiders. These "crawl" the web, finding pages for potential inclusion by following thelinks in the pages they already have in their database. They cannot use imaginationor enter terms in search boxes that they find on the web.If a web page is never linked from any other page, search engine spiders cannot findit. The only way a brand new page can get into a search engine is for other pages tolink to it, or for a human to submit its URL for inclusion. All major search engines offerways to do this.After spiders find pages, they pass them on to another computer program for"indexing." This program identifies the text, links, and other content in the page andstores it in the search engine databases files so that the database can be searchedby keyword and whatever more advanced approaches are offered, and the page willbe found if your search matches its content.Many web pages are excluded from most search engines by policy. The contents ofmost of the searchable databases mounted on the web, such as library catalogs andarticle databases, are excluded because search engine spiders cannot access them.All this material is referred to as the "Invisible Web" -- what you dont see in searchengine results.Recommended Subject DirectoriesUC Berkeley - Teaching Library Internet Workshops - extracted from http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/SubjDirectories.htmlRecommended General Subject Directories:Table of Directory FeaturesWeb ipl2 Infomine About.com Yahoo!Directories www.ipl.org infomine.ucr.edu www.about.com dir.yahoo.comSize, type Over 40,000. Over 125,000. Over 2 million. About 4 million. Highest quality Useful, reliable Generally good Very short sites only. annotations. annotations done descriptions. Useful, reliable Compiled by by "Guides" with Often useful, annotations. academic librarians various levels of especially for Formed by a from the University expertise. popular and merger of the of California and commercial Librarians elsewhere. topics. Internet Index and the Internet Public Library.Phrase No. Yes. Use " " Yes. Use " " Yes. Use " "searching |term term| requires exact matchBoolean OR implied AND implied No. Yes, as in 23
  24. 24. logic between between words. Yahoo! Search words. Also Also accepts OR, web search accepts AND NOT, and ( ). engine. and NOT. Nesting with ( ) does not work.Truncation No. Use *. Also stems. Use *. No.) Can turn stemming Not accepted off. Use " " or | | to consistently. search exact terms.Field No. Limit to Author, No. As in Yahoo!searching Title, Subject, Search web Keyword, search engine. Description, and more.Subject Directories (Contain Databases), and PortalsHow to Find Subject-Focused Directories for a Specific Topic, Discipline, or FieldThere are thousands of specialized directories on practically every subject. If youwant an overview, or if you feel youve searched long enough, try to find one. Oftenthey are done by experts -- self-proclaimed or heavily credentialed. Here are someways to find them:Use any of the Subject Directories above to find more specific directories. Here aresome tips: • In ipl2 or Infomine, look for your subject as you would for any other purpose, and keep your eyes open for sites that look like directories. Read through the descriptions. Sometimes these resources are identified as "Directories, "Virtual Libraries," or "Gateway Pages." • In About.com (A Portal which is a site that links to many other sites according to its site construction or Directory) or Yahoo! directory, try adding the terms web directories to your subject keyword term:EXAMPLES:civil war web directoriesweddings web directories • In About.com, search by topic and look for pages that are described as "101" or "guides" or a "directory." About.com is written by "Guides" who, themselves, often are experts in the sections they manage. Sometimes they write excellent overviews of a topic. 24
  25. 25. Meta-Search EnginesUC Berkeley - Teaching Library Internet WorkshopsWhat Are "Meta-Search" Engines? How Do They Work?In a meta-search engine, you submit keywords in its search box, and it transmits yoursearch simultaneously to several individual search engines and their databases ofweb pages. Within a few seconds, you get back results from all the search enginesqueried. Meta-search engines do not own a database of Web pages; they send yoursearch terms to the databases maintained by search engine companies.Are "Smarter" Meta-Searchers Still Smarter?"Smarter" meta-searcher technology includes clustering and linguistic analysis thatattempts to show you themes within results, and some fancy textual analysis anddisplay that can help you dig deeply into a set of results. However, neither of thesetechnologies is any better than the quality of the search engine databases theyobtain results from.Few meta-searchers allow you to delve into the largest, most useful search enginedatabases. They tend to return results from smaller and/or free search engines andmiscellaneous free directories, often small and highly commercial.Although we respect the potential of textual analysis and clustering technologies, werecommend directly searching individual search engines to get the most preciseresults, and using meta-searchers if you want to explore more broadly.The meta-search tools listed here are "use at your own risk." We are notendorsing or recommending them.Better Meta-Searchers Whats Searched Meta-Search (As of date at bottom of Complex Results Display Tool page. They change Search Ability often.) Yippy Searches Bing, Ask, Accepts Results accompanied with yippy.com Open Directory, and Boolean subdivisions based on (formerly Yahoo (as of 6/15/10). operators AND, words in search results, Clusty) OR, NOT, and intended to give the major limiting by themes. Click on these to "filetype:" and search within results on "site:". each theme. Dogpile Searches Google, Yahoo,www.dogpile.com Bing, and Ask.com (as of 6/15/10). Sites that have purchased ranking and inclusion are mixed into the results. Watch for 25
  26. 26. "Sponsored:". Meta-Search Engines for SERIOUS Deep Digging Whats Complex Search Meta-Search Tool Results Display Searched Ability SurfWax A better than Accepts " ", +/-. Click on source link to www.surfwax.com average set of Default is AND view complete search search engines. between words. I results there. Can mix with recommend fairly Click on to view educational, US simple searches, helpful "SiteSnap™" Govt tools, and allowing SurfWaxs extracted from most news sources, SiteSnaps and other sites in frame on right. or many other features to help you Many additional categories. dig deeply into features for probing results. within a site. Copernic Agent Select from list ALL, ANY, Phrase, Must be downloaded www.copernic.com of search and more. Also and installed, but Basic engines by Boolean searching version is free of clicking on within results under charge. Table Advanced, then "Find in results" > comparing versions. "Modify search "Advanced Find" engine (powerful!). settings".Search Basics: Constructing a Google QuerySearch engines work by providing you with a screen form containing one or morefields into which you type your search term (a combination of words and/or phrases).Single words are quick and easy, but produce much too general a result. With Google,for example, looking for florists yields 24 million hits (search results). If we narrowthe search to florists in Vancouver (i.e. type florists Vancouver), we come up with1.7 million results. Narrow further by making your search term a phrase. To do this,enclose the words in double quotation marks, as in "Vancouver florists". In Google,this example produces just 27,000 hits, because Google is making a match for theexact string of characters we typed.Some search engines provide radio buttons that allow you to specify whether thesearch must match Any or All of the terms you type. Most default to All, returningpages that contain every word used in your search. Choose Any to retrieve pagesthat contain one or more of your search words. This AND versus OR distinction iscalled Boolean logic, and its the key to controlling the search engines. To specify anOR in Google, you must type the word OR between words. In our Vancouver floristsscenario, for example, typing florists OR vancouver results in 85 million hitsbecause it returns all pages containing either the word florists or the word Vancouver. 26
  27. 27. Thus, you might get florists in Hungary and welders in Vancouver! By combiningANDs, ORs, and phrases, you can begin to build truly powerful queries. Learn thesetechniques and many more powerful search strategies in our popular Internet researchcourse.Where does the term Boolean originate from?Boolean searching is built on a method of symbolic logic developed by GeorgeBoole, a 19th century English mathematician. Most online databases and searchengines support Boolean searches. Boolean search techniques can be used to carryout effective searches, cutting out many unrelated documents.Is Boolean Search Complicated?Using Boolean Logic to broaden and/or narrow your search is not as complicated asit sounds; in fact, you might already be doing it. Boolean logic is just the term used todescribe certain logical operations that are used to combine search terms in manysearch engine databases and directories on the Net. Its not rocket science, but itsure sounds fancy (try throwing this phrase out in common conversation!).Basic Boolean Search Operators - ANDUsing AND narows a search by combining terms; it will retrieve documents that useboth the search terms you specify, as in this example: • Portland AND OregonBasic Boolean Search Operators - ORUsing OR broadens a search to include results that contain either of the words youtype in. OR is a good tool to use when there are several common spellings orsynonyms of a word, as in this example: • liberal OR democratBasic Boolean Search Operators - NOTUsing NOT will narrow a search by excluding certain search terms. NOT retrievesdocuments that contain one, but not the other,of the search terms you enter, as inthis example: • Oregon NOT travel.Keep in mind that not all search engines and directories support Boolean terms.However, most do, and you can easily find out if the one you want to use supportsthis technique by consulting the FAQs (Frequently Asked Questions) on a searchengine or directorys home page.Boolean Search And / Or / NotThis is an algebraic concept, but dont let that scare you away. Boolean connectorsare all about sets. There are three little words that are used as Boolean connectors: • and • or • not 27
  28. 28. Think of each keyword as having a "set" of results that are connected with it. Thesesets can be combined to produce a different "set" of results. You can also excludecertain "sets" from your results by using a Boolean connector.AND is a connector that requires both words to be present in each record in theresults. Use AND to narrow your search. Search Term Hits Television 999 hits Violence 876 hits Television and violence 123 hitsThe words television and violence will both be present in each record.OR is a connector that allows either word to be present in each record in the results.Use OR to expand your search. Search Term Hits Adolescents 97 hits Teenagers 75 hits Adolescents or teenagers 172 hitsEither adolescents or teenagers (or both) will be present in each record.NOT is a connector that requires the first word be present in each record in theresults, but only if the record does not contain the second word. Search Term HitsHigh school 423 hitsElementary 652 hitsHigh school not Elementary 275 hitsEach record contains the words high school, but not the word elementary.Boolean Search Examples Boolean Connectors:Interactive Text EquivalentThis Boolean demonstration provides a simple example of how Boolean connectorscan help focus your search as finitely as possible. 28
  29. 29. THE SCENARIOYour research topic: television violenceYou do a separate search for each keyword and get back the following results:Television = 999Violence = 876Thats a lot to wade through. Select AND, OR, or NOT to see how that Booleanconnector will affect this search.ANDYou use AND to connect terms or phrases.We have two words television and violence. To connect them we use the Booleanconnector AND. Compare the results of the search options below:SEARCH #1: televisionResult: A circle balloons until it fills about half the play area. As it gets bigger we seethe word television appear. When its finished generating the results show up =999results.SEARCH #2: violenceResult: A circle balloons until it fills about half the play area. As it gets bigger we seethe word violence appear. When its finished generating the results show up =876results.SEARCH #3: television AND violenceResult: The two circles balloon until they fill the play area as in those above. As theyget bigger we see the words television and violence appear. When theyre finishedgenerating the results show up as above, plus, the same in between the two circles isa different color and it reads as followings:AND =123 resultsORYou use OR to search for multiple terms or phrases.Youve decided to focus on how violence on television affects a specific age group.That is, teenagers. But in your searches youve encountered another term thatsfrequently used: "adolescents.So, in order to get information that uses either term, youd use the OR connector.SEARCH: teenager OR adolescent:Result: Both circles balloon until they fill the play area as above. As they get biggerwe see the words teenager and adolescent appear. When theyre finishedgenerating the results show up as above.Next OR appears between them, and the two circles come towards one another.The text teenager, 75 result and adolescent 97 results stay where they are. As thecircles merge (and change into a new color) the OR disappears behind them. Whenthe merging has finished, the following text appears in the middle of the new circle. 29
  30. 30. Teenager OR Adolescent75 + 97 = 172 resultsthe teenager = 75 results and adolescent =97 results should now be outside thecircle to the left and right.NOTYou use NOT to exclude terms or phrases.In one of your searches you use "high school" as a keyword phrase. You notice thatyou get many results which cover both high school and elementary school. The mainemphasis of your research, as youve followed the process, has turned towards howtelevision violence affects students in high school.So, in order to eliminate unwanted results you use the NOT connector.SEARCH: high schoolThe circle to the left balloons. As it gets bigger we see the words high schoolsappear. When its finished generating the results show up as follows. High school =423 results.SEARCH: elementaryThe circle to the right balloons. As it gets bigger we see the words elementaryappear. When its finished generating, the results show up as follows. Elementary =652 results.SEARCH: high school NOT elementaryBoth circles balloon until they fill the play area as above. When its finishedgeneration the results appear as above, but where the circles overlap it reads: NOT =148 exclusions.Next the elementary circle and the NOT overlap move away from the high schoolcircle. The NOT area like a bite taken out of the high school circle.When the elementary circle and the NOT bite stop, the results in the high schoolcircle change to:High school NOT elementary 423 - 148 exclusions = 275In excluding all references to high school in combination with elementary you get275 results in which high school is only mentioned.How the Search Engines DifferThe Web puts a variety of powerful search engines at your disposal, includingAltavista, Google, All The Web, Teoma, Wisenut, and many more. Which is best?These tools vary in ease of use not to mention features. Your choice of searchengine should be driven by the research challenge you face. Some search enginesare better than others for particular purposes. See below for brief descriptions oftodays major players, their respective strengths and weaknesses, and theiraffiliations:Search Engine Syntax & Features Comparison ChartAn understanding of the syntax differences among search engines is essential tomastery of these tools and the ability to force them to return the precise results you 30
  31. 31. want. Many of these sites appear to operate similarly, at least on the surface. Yetthey can differ substantially in how they understand queries and allow you to filterresults, as well as how they rank the hits returned. Consult our search basics pagefor information on syntax and operators, then experiment with the search engines inthe chart provided. To click through to the various search engines, use the HTMLchart below. We have also provided a PDF version of the chart for printing.Search Boolean Default Phrase Wildcards Case Prefixes FamilyEngine sensitive filterAltavist + - ( ) Phrase, "" Yes No anchor, Yes.a then * 1-5 applet, Password AND, OR, AND characters, domain, protected. AND NOT, must type host, NEAR ( ) first 3 image, like, (Simple characters link, text, Srch) title, urlGoogle OR AND "" Whole word No filetype, Yes wildcard (*) daterange, - cache, link, + to related, include info, spell, stop words stocks, site, intitle, allintitle, inurl, allinurlAll The AND, OR, AND "" No No site, url, YesWeb ANDNOT, link, title, ( ), language, filesize, +, - filetype ( ) means ORWisenu +, - AND "" No No language YestTeoma -, OR AND "" No No intitle, inurl, No site, inlink, + to lang, include afterdate, stop words beforedate, between date 31
  32. 32. Google: Google is the worlds most popular search engine. Claiming to search 3.3billion pages (thats practically the entire Web!), this search engine remainsundisputed king in terms of size. Google produces highly relevant results, using linkpopularity for ranking. Googles original claim to fame was its speed, although itsclean, uncluttered interface has also won fans. Google defaults to AND whenprocessing queries containing two or more words (returning pages that match allwords specified). If you want either word (as in alternate spellings of color), you mustactually force Google to see your search this way, by specifying the Boolean ORoperator, as in color OR colour. Google supports exact phrase searching plus theability to exclude words (use the minus sign) and to constrain by domain and othercriteria. Alliances: Google has taken over the Deja newsgroup archive. It powershundreds of other search engines and the web search feature of directories likeYahoo. Googles Web directory is provided by DMOZ.Altavista: Still the champ in terms of raw search power, Altavista was recentlypurchased by Overture, the Nets major pay-per-click search company. Altavistasindex is respectable, at 1 billion pages. It defaults to OR, ordering search resultsaccording to number, location and proximity of search term occurrences. UseAltavista when you need to construct complex queries containing nestedcombinations of AND and OR. Altavista supports the quasi-Boolean operators (+, -)and the formal Boolean operators (AND, OR, AND NOT, NEAR). This search engineallows you to constrain your search by domain, location within page, date, andnumerous other criteria. Drawbacks include notoriously buggy hit counts and aninterface that could stand some usability improvements. Alliances: Altavista, too,powers hundreds of other sites. Its web directory is provided by DMOZ.All The Web: At first glance, All The Web looks much like Google, providing theclean look and user-friendliness of the industry leader. All The Web defaults to AND,with a convenient tick box that allows you to specify a phrase. Its index rivalsGoogles, at 3.2 billion documents. It does not recognize formal Boolean arguments,although it supports quasi-Boolean operators (+, -) and the ability to constrain bydomain, location within page, and several other criteria. Alliances: All The Web wasalso recently taken over by Overture.Wisenut: Known for its clean screen and speedy performance, Wisenut set out torival Google. A "clustering" search engine, Wisenut groups results into categories itcalls "WiseGuide." Small plus and minus signs allow you to collapse and expandthese categories. Like Google, Altavista, and other major players, Wisenut is aspider-based search engine that crawls, links and indexes page contents. Wisenutclaims to have an index of 1.5 billion pages. Wisenut defaults to AND, and supportsphrase searching and the + and - operators, though it offers no advanced searchfeatures as yet. Alliances: Wisenut is owned by Looksmart.Teoma: Like Wisenut, Teoma set out to emulate Googles clean screen and fastperformance. It too defaults to AND. Teomas index is a respectable 1.5 billionpages. Like Google, Teoma evaluates page popularity, using complex relevance andlink popularity algorithms to rank results. Teoma clusters search results at the top ofthe screen and displays a list of what it calls "Expert Link Collections" at bottom right.These listings point to sites Teoma considers authoritative link collections relevant tothe subject of your search. Sometimes called jumplists, link collections can be amongthe Webs hidden treasures. Teoma is one of the few search engines to identify 32
  33. 33. them. This feature alone makes it a valuable addition to your bookmark list.Alliances: Teoma was acquired by Ask Jeeves in 2001.Site contents Copyright © 1994-2005 Pam Blackstone. All rights reserved.Some Search Tips, Tricks, & TechniquesTheres more to search success than simply typing a few words into a search engine.Here are a few points to keep in mind for your next search. • Choose the right tool for the job. Its not all about search engines! Choosing the appropriate research tool is half the battle. Know when to use a specialized resource such as telephone directory , a regional directory, or a reference work like those youd find at the Library. • Familiarize yourself with search engine syntax. The search engines all differ in the rules they apply when processing your query. Did you know, for example, that Google limits queries to ten words? If you type more than ten words, Google simply truncates your query, dropping excess words off the end. Thats one good reason to plan your search strategy carefully! Check search engine sites for a link labelled Help or Search Tips for syntax information, and see our search basics page and feature comparison chart for more on this important success factor. • Think outside the box when specifying your search term. Its very much a trial and error process. Think about how the information youre after might be indexed. If you did not get results with one word, try a synonym. If, for example, youre seeking information about sailing, you might want to try both the words sailing and yachting. If a word has alternate spellings, specify it both ways (colour and color, for example). • Understand results ranking. Search engines use complicated formulas to order results. Most search engines evaluate web documents against your keywords, ordering results by relevance. They do this by assigning a numeric score to each hit, based on how closely it matches the specified term. They all use different criteria for arriving at this score. Some search engines also factor popularity with users into how they order results, and they measure this in different ways as well. Be aware that advertising may also influence results ranking. • Take advantage of collective human experience. Know when to tap into archived discussions. Look on the Web for facts; ask in discussion groups for opinions. Turn to newsgroups, mailing lists, or web forums for solutions to problems or for answers to obscure or esoteric questions. Google maintains a handy searchable archive of online discussions. Chances are, someones already answered your question! • Let someone else do the work. Sometimes, the fastest way to the information youre after is to locate a jumplist. Specialized collections of links on one subject or theme, jumplists are the hidden treasure of the Web. To find them, try adding words like links, resources, collection, or list to your search term. Yahoo can be useful for finding jumplists, which you can locate by selecting "Web Directories" from many of its menus and sub-menus. The 33
  34. 34. Teoma search engine is also useful in locating jumplists, which it calls "expert link collections." • Sign up for our popular Internet research course to find out more. Among the many topics covered, youll learn some little-known but potent Google techniques for ferreting out the Nets most stubbornly elusive information!Finding Information on the Internet: A Tutorialhttp://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.htmlInvisible or Deep Web: What it is, How to find it, and its inherent ambiguityWhat is the "Invisible Web", a.k.a. the "Deep Web"?Why isnt everything visible?There are still some hurdles search engine crawlers cannot leap. Here are someexamples of material that remains hidden from general search engines: • The Contents of Searchable Databases. When you search in a library catalog, article database, statistical database, etc., the results are generated "on the fly" in answer to your search. Because the crawler programs cannot type or think, they cannot enter passwords on a login screen or keywords in a search box. Thus, these databases must be searched separately. o A special case: Google Scholar is part of the public or visible web. It contains citations to journal articles and other publications, with links to publishers or other sources where one can try to access the full text of the items. This is convenient, but results in Google Scholar are only a small fraction of all the scholarly publications that exist online. Much more - including most of the full text - is available through article databases that are part of the invisible web. The UC Berkeley Library subscribes to over 200 of these, accessible to our students, faculty, staff, and on-campus visitors through our Find Articles page. • Excluded Pages. Search engine companies exclude some types of pages by policy, to avoid cluttering their databases with unwanted content. o Dynamically generated pages of little value beyond single use. Think of the billions of possible web pages generated by searches for books in library catalogs, public-record databases, etc. Each of these is created in response to a specific need. Search engines do not want all these pages in their web databases, since they generally are not of broad interest. o Pages deliberately excluded by their owners. A web page creator who does not want his/her page showing up in search engines can insert special "meta tags" that will not display on the screen, but will cause most search engines crawlers to avoid the page. 34
  35. 35. How to Find the Invisible WebSimply think "databases" and keep your eyes open. You can find searchabledatabases containing invisible web pages in the course of routine searching in mostgeneral web directories. Of particular value in academic research are: • ipl2 • InfomineUse Google and other search engines to locate searchable databases by searching asubject term and the word "database". If the database uses the word database in itsown pages, you are likely to find it in Google. The word "database" is also useful insearching a topic in the Google Directory or the Yahoo! directory, because theysometimes use the term to describe searchable databases in their listings.Examples:plane crash databaselanguages databasetoxic chemicals databaseRemember that the Invisible Web exists. In addition to what you find in searchengine results (including Google Scholar) and most web directories, there are othergold mines you have to search directly. This includes all of the licensed article,magazine, reference, news archives, and other research resources that libraries andsome industries buy for those authorized to use them.As part of your web search strategy, spend a little time looking for databases in yourfield or topic of study or research. The contents of these may not be freely available:libraries and corporations buy the rights for their authorized users to view thecontents. If they appear free, its because you are somehow authorized to search andread the contents (library card holder, company employee, etc.).The Ambiguity Inherent in the Invisible Web:It is very difficult to predict what sites or kinds of sites or portions of sites will or wontbe part of the Invisible Web. There are several factors involved: o Which sites replicate some of their content in static pages (hybrid of visible and invisible in some combination)? o Which replicate it all (visible in search engines if you construct a search matching terms in the page)? o Which databases replicate none of their dynamically generated pages in links and must be searched directly (totally invisible)? o Search engines can change their policies on what they exclude and include.Want to learn more about the Invisible Web? • The Wikipedia "Deep Web" article provides a fairly up-to-date summary, with links to other resources. 35
  36. 36. 10 Search Engines to Explore the Invisible Webby Saikat Basu March 14, 2010Image credit: MarcelGermain Saikat BasuSaikat is a techno-adventurer in a writers garb. When he is not scouring the net fortech news, you can catch him looking for life hacks and learning tidbits.The Invisible Web refers to the part of the WWW that’s not indexed by the searchengines. Most of us think that that search powerhouses like Google and Bing are likethe Great Oracle”¦they see everything. Unfortunately, they can’t because they aren’tdivine at all; they are just web spiders who index pages by following one hyperlinkafter the other.But there are some places where a spider cannot enter. Take library databaseswhich need a password for access. Or even pages that belong to private networks oforganizations. Dynamically generated web pages in response to a query are oftenleft un-indexed by search engine spiders.Search engine technology has progressed by leaps and bounds. Today, we havereal time search and the capability to index Flash based and PDF content. Eventhen, there remain large swathes of the web which a general search engine cannotpenetrate. The term, Deep Net, Deep Web or Invisible Web lingers on.To get a more precise idea of the nature of this “˜Dark Continent’ involving theinvisible and web search engines, read what Wikipedia has to say about the DeepWeb. The figures are attention grabbers ““ the size of the open web is 167 terabytes.The Invisible Web is estimated at 91,000 terabytes. Check this out – the Library ofCongress, in 1997, was figured to have close to 3,000 terabytes!How do we get to this mother lode of information?That’s what this post is all about. Let’s get to know a few resources which will be ourdeep diving vessel for the Invisible Web. Some of these are invisible web searchengines with specifically indexed information.Infomine 36
  37. 37. Infomine has been built by a pool of libraries in the United States. Some of them areUniversity of California, Wake Forest University, California State University, and theUniversity of Detroit. Infomine “˜mines’ information from databases, electronicjournals, electronic books, bulletin boards, mailing lists, online library card catalogs,articles, directories of researchers, and many other resources.You can search by subject category and further tweak your search using the searchoptions. Infomine is not only a standalone search engine for the Deep Web but also astaging point for a lot of other reference information. Check out its Other SearchTools and General Reference links at the bottom.The WWW Virtual LibraryThis is considered to be the oldest catalog on the web and was started by started byTim Berners-Lee, the creator of the web. So, isn’t it strange that it finds a place in thelist of Invisible Web resources? Maybe, but the WWW Virtual Library lists quite a lotof relevant resources on quite a lot of subjects. You can go vertically into thecategories or use the search bar. The screenshot shows the alphabeticalarrangement of subjects covered at the site.Intute 37
  38. 38. Intute is UK centric, but it has some of the most esteemed universities of the regionproviding the resources for study and research. You can browse by subject or do akeyword search for academic topics like agriculture to veterinary medicine. Theonline service has subject specialists who review and index other websites that caterto the topics for study and research.Intute also provides free of cost over 60 free online tutorials to learn effective internetresearch skills. Tutorials are step by step guides and are arranged around specificsubjects.Complete PlanetComplete Planet calls itself the “˜front door to the Deep Web’. This free and welldesigned directory resource makes it easy to access the mass of dynamic databasesthat are cloaked from a general purpose search. The databases indexed byComplete Planet number around 70,000 and range from Agriculture to Weather. Alsothrown in are databases like Food & Drink and Military.For a really effective Deep Web search, try out the Advanced Search options whereamong other things, you can set a date range.Infoplease 38
  39. 39. Infoplease is an information portal with a host of features. Using the site, you can tapinto a good number of encyclopedias, almanacs, an atlas, and biographies.Infoplease also has a few nice offshoots like Factmonster.com for kids and Biosearch,a search engine just for biographies.DeepPeepDeepPeep aims to enter the Invisible Web through forms that query databases andweb services for information. Typed queries open up dynamic but short lived resultswhich cannot be indexed by normal search engines. By indexing databases,DeepPeep hopes to track 45,000 forms across 7 domains.The domains covered by DeepPeep (Beta) are Auto, Airfare, Biology, Book, Hotel,Job, and Rental. Being a beta service, there are occasional glitches as some resultsdon’t load in the browser.IncyWincyIncyWincy is an Invisible Web search engine and it behaves as a meta-searchengine by tapping into other search engines and filtering the results. It searches theweb, directory, forms, and images. With a free registration, you can track searchresults with alerts.DeepWebTech 39
  40. 40. DeepWebTech gives you five search engines (and browser plugins) for specifictopics. The search engines cover science, medicine, and business. Using these topicspecific search engines, you can query the underlying databases in the Deep Web.ScirusScirus has a pure scientific focus. It is a far reaching research engine that can scourjournals, scientists’ homepages, courseware, pre-print server material, patents andinstitutional intranets.TechXtra 40
  41. 41. TechXtra concentrates on engineering, mathematics and computing. It gives youindustry news, job announcements, technical reports, technical data, full text eprints,teaching and learning resources along with articles and relevant website information.Just like general web search, searching the Invisible Web is also about looking forthe needle in the haystack. Only here, the haystack is much bigger. The InvisibleWeb is definitely not for the casual searcher. It is a deep but not dark because if youknow what you are searching for, enlightenment is a few keywords away.Do you venture into the Invisible Web? Which is your preferred search tool?The Invisible Web DatabasesWhich database might have Turbo10 Search user-selected deepthe information I need? Web resources Resource Discovery Keyword search Network Complete Planet Deep Web directory Digital Librarian and Uncover databases Librarians Guide to the InternetNews and magazines Google News Search 30 day news archive (for US, UK, others) AltaVista News Includes New York Times 1st Headlines Breaking news in categories (US & World; Business; Health; Lifestyles; Sports; Technology; Weather) New York Times Full-text newspaper archive Washington Post search (14 or 30 day trials Seattle Times available) San Francisco Chronicle HeadlineSpot Search news directory by media, region, subject, opinion 41
  42. 42. Directory of Open Search or browse by subject Access Journals for peer-reviewed, scientific (DOAJ) and scholarly titles HeadlineSpot: Search magazine directory Magazines by subjectPublic Radio webcasts PublicRadioFan.com Search database of program listingsHistory Guide to History on Database of more than the Web 5,000 US and world history sitesBiography Galileo Project, Individuals Thomas A. Edison Papers Biography.com 25,000 people Biographical 28,000 short identification Dictionary informationCountries Nations Online Alphabetical index to Project, Thomas A. government Web pages Edison Papers Portals to the World From the Library of Congress World Fact Book From the CIA Infonation U.N. member nations Country Profiels From the BBCData Finding and Using Statistical DataBooks (full text) Online Books Page Free e-books 42

×