The Little Search Engine That Can! How To Get Started on an Internet Search Mrs. Cathleen Carpenter Course: Searching and Researching on the Internet
Human-Powered Directories A human-powered directory, such as the Open Directory, depends on people for its listings. You submit a short description to the directory for your entire site, or editors write one for sites they review. A search looks for matches only in the descriptions submitted. Changing your web pages has no effect on your listing. How Search Engines Work The term "search engine" is often used generically to describe both crawler-based search engines and human-powered directories. These two types of search engines gather their listings in very different ways. Crawler-Based Search Engines Crawler-based search engines, such as Google, create their listings automatically. They "crawl" or "spider" the web, then people search through what they have found. If you change your web pages, crawler-based search engines eventually find these changes, and that can affect how you are listed. Page titles, body copy and other elements all play a role. Resource: Danny Sullivan, Search Engine Watch, Mar 14, 2007, http://searchenginewatch.com/showPage.html?page=2168031 If this student were paying attention, he might be able to do a search and find a job that allows him to sleep all day!
Search Engine Elements Crawler-based search engines have three major elements. First is the spider, also called the crawler. The spider visits a web page, reads it, and then follows links to other pages within the site. This is what it means when someone refers to a site being "spidered" or "crawled." The spider returns to the site on a regular basis, such as every month or two, to look for changes. Everything the spider finds goes into the second part of the search engine, the index. The index, sometimes called the catalog, is like a giant book containing a copy of every web page that the spider finds. If a web page changes, then this book is updated with new information. Search engine software is the third part of a search engine. This is the program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant. Resource: Danny Sullivan, Search Engine Watch, Mar 14, 2007, http://searchenginewatch.com/showPage.html?page=2168031 CRAWLER INDEX SOFTWARE
Remember, you are smarter than a computer. Use your intelligence. Search engines are fast, but dumb. A search engine's ability to understand what you want is very limited. It will obediently look for occurrences of your keywords all over the Web, but it doesn't understand what your keywords mean or why they're important to you. To a search engine, a keyword is just a string of characters. It doesn't know the difference between cancer the crab and cancer the disease...and it doesn't care. But you know what you query means (at least, we hope you do!). Therefore, you must supply the brains. The search engine will supply the raw computing power. Key to Success... 1. Know where to look first. 2. Fine tune your key words. 3. Be refined. 4. Query (search) by example 5. Anticipate the answers. Resource: http://www.monash.com/spidap5.html
3. Be Refined Read the help files and take advantage of the available search refining options. Use phrases, if possible. Use the Boolean AND (or the character +) to include other keywords that you would expect to find in relevant documents. Learn to EXCLUDE with the Boolean NOT. Excluding is particularly important as the Web grows and more documents are posted. Resource: http://www.monash.com/spidap5.html 1. Know Where To Look First Are you looking for information about a person? A company? A software product? A health-related problem? Do you want to find a job? Get a date? Plan a vacation? Do you need to research a term paper? Document a news story? There are various databases containing specific information that might be more useful to you than a general search engine. 2. Fine-tune your keywords If you're searching on a noun (the name of a person, place or thing), remember that most nouns are subsets of other nouns. Enter the smallest possible subset that describes what you want. Be specific. Example: If you want to buy a car, don't enter the keyword "car" if you can enter the keyword "Toyota." Better still, enter the phrase "Toyota Dealerships" AND the name of the city where you live.
4. Query by example Take advantage of the option that many search engine sites are now offering: you can "query by example," or "find similar sites," to the ones that come up on your initial hit list. Essentially what you're doing is telling the search engine, "yes, this looks promising, give me more like this one." 5. Anticipate the answers Before searching, try to imagine what the ideal page you would like to access would look like. Think about the words its title would contain. Think about what words would be in the first couple of sentences of a webpage that you would consider useful. Use those words, or that phrase, when you enter your query. Resource: http://www.monash.com/spidap5.html Example: If you want to find out how medical details about your grandmother's diagnosis of Alzheimer's Disease, try entering "Alzheimer's" AND "symptoms" AND "prognosis." If you want to find out about Alzheimer's care and community resources, query on "Alzheimer's" AND "support groups" AND "resources" AND NOT "symptoms."
Search for anything using your favorite crawler-based search engine. Nearly instantly, the search engine will sort through the millions of pages it knows about and present you with ones that match your topic. The matches will even be ranked, so that the most relevant ones come first. Of course, the search engines don't always get it right. Non-relevant pages make it through, and sometimes it may take a little more digging to find what you are looking for. But, by and large, search engines do an amazing job. As WebCrawler founder Brian Pinkerton puts it, "Imagine walking up to a librarian and saying, 'travel.' They’re going to look at you with a blank face.“ OK -- a librarian's not really going to stare at you with a vacant expression. Instead, they're going to ask you questions to better understand what you are looking for. Unfortunately, search engines don't have the ability to ask a few questions to focus your search, as a librarian can. They also can't rely on judgment and past experience to rank web pages, in the way humans can. So, how do crawler-based search engines go about determining relevancy, when confronted with hundreds of millions of web pages to sort through? They follow a set of rules, known as an algorithm and all major search engines follow some general rules. Why Your Search Results Come Back in an Certain Order Resource: Danny Sullivan, Search Engine Watch, Mar 14, 2007, http://searchenginewatch.com/showPage.html?page=2168031
Search engines will also check to see if the search keywords appear near the top of a web page, such as in the headline or in the first few paragraphs of text. They assume that any page relevant to the topic will mention those words right from the beginning. Frequency is the other major factor in how search engines determine relevancy. A search engine will analyze how often keywords appear in relation to other words in a web page. Those with a higher frequency are often deemed more relevant than other web pages. Resource: Danny Sullivan, Search Engine Watch, Mar 14, 2007, http://searchenginewatch.com/showPage.html?page=2168031 Location, Location, Location...and Frequency One of the main rules involves the location and frequency of keywords on a web page. Call it the location/frequency method, for short. Remember the librarian mentioned before? They need to find books to match your request of "travel," so it makes sense that they first look at books with travel in the title. Search engines operate the same way. Pages with the search terms appearing in the HTML title tag are often assumed to be more relevant than others to the topic. Search engines may also penalize pages or exclude them from the index, if they detect search engine "spamming." An example is when a word is repeated hundreds of times on a page, to increase the frequency and propel the page higher in the listings. Search engines watch for common spamming methods in a variety of ways, including following up on complaints from their users.
What is a keyword, exactly? It can simply be any word on a webpage. For example, I used the word "simply" in the previous sentence, making it one of the keywords for this particular webpage in some search engine's index. However, since the word "simply" has nothing to do with the subject of this webpage (i.e., how search engines work), it is not a very useful keyword. Useful keywords and key phrases for this page would be "search," "search engines," "search engine methods," "how search engines work," "ranking" "relevancy," "search engine tutorials," etc. Those keywords would actually tell a user something about the subject and content of this page. Keyword Searching Resource: http://www.monash.com/spidap5.html Unless the author of the Web document specifies the keywords for her document it's up to the search engine to determine them. Essentially, this means that search engines pull out and index words that appear to be significant. Since engines are software programs, not rational human beings, they work according to rules established by their creators for what words are usually important in a broad range of documents.
The title of a page, for example, usually gives useful information about the subject of the page (if it doesn't, it should!). Words that are mentioned towards the beginning of a document (think of the "topic sentence" in a high school essay, where you lay out the subject you intend to discuss) are given more weight by most search engines. The same goes for words that are repeated several times throughout the document. Problems? Keyword searches have a tough time distinguishing between words that are spelled the same way, but mean something different (i.e. hard cider, a hard stone, a hard exam, and the hard drive on your computer). This often results in hits that are completely irrelevant to your query (search). Search engines also cannot return hits on keywords that mean the same, but are not actually entered in your query. A query on heart disease would not return a document that used the word "cardiac" instead of "heart." Resource: http://www.monash.com/spidap5.html
Most sites offer two different types of searches--"basic" and "refined" or "advanced." In a "basic" search, you just enter a keyword without sifting through any pull-down menus of additional options. Depending on the engine, though, "basic" searches can be quite complex. Advanced search refining options differ from one search engine to another, but some of the possibilities include the ability to search on more than one word, to give more weight to one search term than you give to another, and to exclude words that might be likely to muddy the results. You might also be able to search on proper names, on phrases, and on words that are found within a certain proximity to other search terms. Refining Your Search Resource: http://www.monash.com/spidap5.html Some search engines also allow you to specify what form you'd like your results to appear in, and whether you wish to restrict your search to certain fields on the internet (i.e., usenet or the Web) or to specific parts of Web documents (i.e., the title or URL).
Many, but not all search engines allow you to use so-called Boolean operators to refine your search. These are the logical terms AND, OR, NOT , and the so-called proximal locators, NEAR and FOLLOWED BY . All graphics: www.animationfactory.com Resource: http://www.monash.com/spidap5.html Capitalization: This is essential for searching on proper names of people, companies or products. Unfortunately, many words in English are used both as proper and common nouns--Bill, bill, Gates, gates, Oracle, oracle, Lotus, lotus, Digital, digital--the list is endless. Final Hints 1. Boolean AND means that all the terms you specify must appear in the documents, i.e., "heart" AND "attack." You might use this if you wanted to exclude common hits that would be irrelevant to your query. 2. Boolean OR means that at least one of the terms you specify must appear in the documents, i.e., bronchitis, acute OR chronic. You might use this if you didn't want to rule out too much. 3. Boolean NOT means that at least one of the terms you specify must not appear in the documents. You might use this if you anticipated results that would be totally off-base, i.e., nirvana AND Buddhism, NOT Cobain. 4. NEAR means that the terms you enter should be within a certain number of words of each other. FOLLOWED BY means that one term must directly follow the other.