• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Search engine
 

Search engine

on

  • 1,815 views

 

Statistics

Views

Total Views
1,815
Views on SlideShare
1,815
Embed Views
0

Actions

Likes
0
Downloads
34
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft Word

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Search engine Search engine Document Transcript

    • 1[http://en.wikipedia.org/wiki/Search_engine_optimization]Search engine optimization (SEO) is the process of improving the visibility of a website ora webpage in search engines via the "natural" or un-paid ("organic" or "algorithmic") searchresults. Additional Search engine marketing (SEM) methods including paid listings mayachieve higher effectivity. In general, the earlier (or higher on the page), and more frequentlya site appears in the search results list, the more visitors it will receive from the searchengines users. SEO may target different kinds of search, including image search, localsearch, video search, academic search, news search and industry-specific vertical searchengines. This gives a website web presence.As an Internet marketing strategy, SEO considers how search engines work, what peoplesearch for, the actual search terms typed into search engines and which search engines arepreferred by their targeted audience. Optimizing a website may involve editing its contentand HTML and associated coding to both increase its relevance to specific keywords and toremove barriers to the indexing activities of search engines. Promoting a site to increase thenumber of backlinks, or inbound links, is another SEO tactic.The acronym "SEOs" can refer to "search engine optimizers," a term adopted by an industryof consultants who carry out optimization projects on behalf of clients, and by employeeswho perform SEO services in-house. Search engine optimizers may offer SEO as a stand-alone service or as a part of a broader marketing campaign. Because effective SEO mayrequire changes to the HTML source code of a site and site content, SEO tactics may beincorporated into website development and design. The term "search engine friendly" may beused to describe website designs, menus, content management systems, images, videos,shopping carts, and other elements that have been optimized for the purpose of search engineexposure.Another class of techniques, known as black hat SEO, search engine poisoning, orspamdexing, uses methods such as link farms, keyword stuffing and article spinning thatdegrade both the relevance of search results and the quality of user-experience with searchengines. Search engines look for sites that employ these techniques in order to remove themfrom their indices.Search engine optimization methods are techniques used by webmasters to get morevisibility for their sites in search engine results pages.•Getting indexedThe leading search engines, such as Google Bing and Yahoo, use crawlers to find pages fortheir algorithmic search results. Pages that are linked from other search engine indexed pagesdo not need to be submitted because they are found automatically. Some search engines,notably Yahoo!, operate a paid submission service that guarantees crawling for either a setfee or cost per click. Such programs usually guarantee inclusion in the database, but do notguarantee specific ranking within the search results. Two major directories, the YahooDirectory and the Open Directory Project both require manual submission and humaneditorial review.[2] Google offers Google Webmaster Tools, for which an XMLSitemap feed
    • can be created and submitted for free to ensure that all pages are found, especially pages thatarent discoverable by automatically following links.[3]Search engine crawlers may look at a number of different factors when crawling a site. Notevery page is indexed by the search engines. Distance of pages from the root directory of asite may also be a factor in whether or not pages get crawled.Other methodsA variety of other methods are employed to get a webpage indexed and shown higher in theresults and often a combination of these methods are used as part of a search engineoptimization campaign. • Cross linking between pages of the same website. Giving more links to main pages of the website, to increase Page Rank used by search engines. Linking from other websites, including link farming and comment spam. • Keyword rich text in the webpage and key phrases, so as to match all search queries.[7] Adding relevant keywords to a web page meta tags, including keyword stuffing. • URL normalization for webpages with multiple urls, using "canonical" meta tag.[8] • A backlink from a Web directory. • SEO Trending based on recent search behaviour using tools like Google Insights for Search. • Media Content creation like press releases and online news letters to generate an amount of incoming linksContent Creation and LinkingContent creation is one of the primary focuses of any SEOs job. Without unique, relevant,and easily scannable content users tend to spend little to no time paying attention to awebsite. Almost all SEOs that provide organic search improve focus heavily on creating thistype of content, or "linkbait". Linkbait is a term used to describe content that is designed tobe shared and replicated virally in an effort to gain backlinks.Often, webmasters and content administrators create blogs to easily provide this informationthrough a method that is intrinsically viral. However, most forget that traffic generated toblog accounts dont point back to their respective domains, so they lose "link juice". Linkjuice is jargon for links that provide a boost to Page Rank and Trust Rank. Changing thedomain of the blog, to a subdomain of the respective domain is a quick way to combat thissiphoning of link juice .Other commonly implemented methodologies for creating and disseminating content includeYouTube Videos, Google Places accounts, as well as Picasa and Flickr photos indexed inGoogle Images Searches. These additional forms of content allow webmasters to producecontent that ranks well in the worlds second most popular search engine - YouTube, inaddition to appearing in organic search results.Gray hat techniques
    • Gray hat techniques are those that are neither really white nor black hat. Some of these grayhat techniques may be argued either way. These techniques might have some risk associatedwith them. A very good example of such a technique is purchasing links. The average pricefor a text link depends on the perceived authority of the linking page. The authority issometimes measured by Googles PageRank, although this is not necessarily an accurate wayof determining the importance of a page.While Google is against sale and purchase of links there are people who subscribe to onlinemagazines, memberships and other resources for the purpose of getting a link back to theirwebsite.Another widely used gray hat technique is a webmaster creating multiple micro-sites whichhe or she controls for the sole purpose of cross linking to the target site. Since it is the sameowner of all the micro-sites, this is a violation of the principles of the search enginesalgorithms (by self-linking) but since ownership of sites is not traceable by search engines itis impossible to detect and therefore they can appear as different sites, especially when usingseparate Class-C IPs.In computing, spamdexing (also known as search spam, search engine spam, web spam orSearch Engine Poisoning is the deliberate manipulation of search engine indexes It involvesa number of methods, such as repeating unrelated phrases, to manipulate the relevance orprominence of resources indexed in a manner inconsistent with the purpose of the indexingsystem. Some consider it to be a part of search engine optimization, though there are manysearch engine optimization methods that improve the quality and appearance of the content ofweb sites and serve content useful to many users. Search engines use a variety of algorithmsto determine relevancy ranking. Some of these include determining whether the search termappears in the META keywords tag, others whether the search term appears in the body Textor URL of a web page. Many search engines check for instances of spamdexing and willremove suspect pages from their indexes. Also, people working for a search-engineorganization can quickly block the results-listing from entire websites that use spamdexing,perhaps alerted by user complaints of false matches. The rise of spamdexing in the mid-1990smade the leading search engines of the time less useful.Common spamdexing techniques can be classified into two broad classes: content spam (orterm spam) and link spam.Content spamThese techniques involve altering the logical view that a search engine has over the pagescontents. They all aim at variants of the vector space model for information retrieval on textcollections.Keyword stuffingKeyword stuffing involves the calculated placement of keywords within a page to raise thekeyword count, variety, and density of the page. This is useful to make a page appear to berelevant for a web crawler in a way that makes it more likely to be found. Example: A
    • promoter of a Ponzi scheme wants to attract web surfers to a site where he advertises hisscam. He places hidden text appropriate for a fan page of a popular music group on his page,hoping that the page will be listed as a fan site and receive many visits from music lovers.Older versions of indexing programs simply counted how often a keyword appeared, andused that to determine relevance levels. Most modern search engines have the ability toanalyze a page for keyword stuffing and determine whether the frequency is consistent withother sites created specifically to attract search engine traffic. Also, large webpages aretruncated, so that massive dictionary lists cannot be indexed on a single webpage.Hidden or invisible textUnrelated hidden text is disguised by making it the same color as the background, using atiny font size, or hiding it within HTML code such as "no frame" sections, alt attributes, zero-sized DIVs, and "no script" sections. People screening websites for a search-engine companymight temporarily or permanently block an entire website for having invisible text on some ofits pages. However, hidden text is not always spamdexing: it can also be used to enhanceaccessibility.Meta-tag stuffingThis involves repeating keywords in the Meta tags, and using meta keywords that areunrelated to the sites content. This tactic has been ineffective since 2005.Doorway pages"Gateway" or doorway pages are low-quality web pages created with very little content butare instead stuffed with very similar keywords and phrases. They are designed to rank highlywithin the search results, but serve no purpose to visitors looking for information. A doorwaypage will generally have "click here to enter" on the page.scraper sitesScraper sites sites, are created using various programs designed to "scrape" search-engineresults pages or other sources of content and create "content" for a website. The specificpresentation of content on these sites is unique, but is merely an amalgamation of contenttaken from other sources, often without permission. Such websites are generally full ofadvertising (such as pay-per-click ads), or they redirect the user to other sites. It is evenfeasible for scraper sites to outrank original websites for their own information andorganization names.Article spinningArticle spinning involves rewriting existing articles, as opposed to merely scraping contentfrom other sites, to avoid penalties imposed by search engines for duplicate content. Thisprocess is undertaken by hired writers or automated using a thesaurus database or a neuralnetwork.Link spam
    • Link spam is defined as links between pages that are present for reasons other than merit.Link spam takes advantage of link-based ranking algorithms, which gives websites higherrankings the more other highly ranked websites link to it. These techniques also aim atinfluencing other link-based ranking techniques such as the HITS algorithm.Link-building softwareA common form of link spam is the use of link-building software to automate the searchengine optimization process.Link farmsLink farms are tightly-knit communities of pages referencing each other, also knownhumorously as mutual admiration societies.Hidden linksPutting hyperlinks where visitors will not see them to increase link popularity. Highlightedlink text can help rank a webpage higher for matching that phrase.Sybil attackA Sybil attack is the forging of multiple identities for malicious intent, named after thefamous multiple personality disorder patient "Sybil" (Shirley Ardell Mason). A spammer maycreate multiple web sites at different domain names that all link to each other, such as fakeblogs (known as spam blogs).Spam blogsSpam blogs are blogs created solely for commercial promotion and the passage of linkauthority to target sites. Often these "splogs" are designed in a misleading manner that willgive the effect of a legitimate website but upon close inspection will often be written usingspinning software or very poorly written and barely readable content. They are similar innature to link farms.Page hijackingPage hijacking is achieved by creating a rogue copy of a popular website which showscontents similar to the original to a web crawler but redirects web surfers to unrelated ormalicious websites.Buying expired domainsSome link spammers monitor DNS records for domains that will expire soon, then buy themwhen they expire and replace the pages with links to their pages. See Domaining. HoweverGoogle resets the link data on expired domains. Some of these techniques may be applied forcreating a Google bomb, this is, to cooperate with other users to boost the ranking of aparticular page for a particular query.
    • Cookie stuffingCookie stuffing involves placing an affiliate tracking cookie on a website visitors computerwithout their knowledge, which will then generate revenue for the person doing the cookiestuffing. This not only generates fraudulent affiliate sales, but also has the potential tooverwrite other affiliates cookies, essentially stealing their legitimately earned commissions.] Using world-writable pagesMain article: forum spamWeb sites that can be edited by users can be used by spandexes to insert links to spam sites ifthe appropriate anti-spam measures are not taken.Automated spam bots can rapidly make the user-editable portion of a site unusable.Programmers have developed a variety of automated spam prevention techniques to block orat least slow down spam bots.Spam in blogsSpam in blogs is the placing or solicitation of links randomly on other sites, placing a desiredkeyword into the hyperlinked text of the inbound link. Guest books, forums, blogs, and anysite that accepts visitors comments are particular targets and are often victims of drive-byspamming where automated software creates nonsense posts with links that are usuallyirrelevant and unwanted.Comment spamComment spam is a form of link spam that has arisen in web pages that allow dynamic userediting such as wikis, blogs, and guest books. It can be problematic because agents can bewritten that automatically randomly select a user edited web page, such as a Wikipediaarticle, and add spamming links.Wiki spamWiki spam is a form of link spam on wiki pages. The spammer uses the open edit ability ofwiki systems to place links from the wiki site to the spam site. The subject of the spam site isoften unrelated to the wiki page where the link is added. In early 2005, Wikipediaimplemented a default "no follow" value for the "rel" HTML attribute. Links with thisattribute are ignored by Googles Page Rank algorithm. Forum and Wiki admins can use theseto discourage Wiki spam.Referrer log spammingReferrer spam takes place when a spam perpetrator or facilitator accesses a web page (thereferee), by following a link from another web page (the referrer), so that the referee is giventhe address of the referrer by the persons Internet browser. Some websites have a referrer logwhich shows which pages link to that site. By having a robot randomly access many sitesenough times, with a message or specific address given as the referrer, that message orInternet address then appears in the referrer log of those sites that have referrer logs. Since
    • some Web search engines base the importance of sites on the number of different siteslinking to them, referrer-log spam may increase the search engine rankings of the spammerssites. Also, site administrators who notice the referrer log entries in their logs may follow thelink back to the spammers referrer page.2[http://www.webconfs.com/seo-tutorial/introduction-to-seo.php]Whenever you enter a query in a search engine and hit enter you get a list of web results thatcontain that query term. Users normally tend to visit websites that are at the top of this list asthey perceive those to be more relevant to the query. If you have ever wondered why some ofthese websites rank better than the others then you must know that it is because of a powerfulweb marketing technique called Search Engine Optimization (SEO).SEO is a technique which helps search engines find and rank your site higher than themillions of other sites in response to a search query. SEO thus helps you get traffic fromsearch engines.This SEO tutorial covers all the necessary information you need to know about SearchEngine Optimization - what is it, how does it work and differences in the ranking criteria ofmajor search engines.1. How Search Engines WorkThe first basic truth you need to know to learn SEO is that search engines are not humans.While this might be obvious for everybody, the differences between how humans and searchengines view web pages arent. Unlike humans, search engines are text-driven. Althoughtechnology advances rapidly, search engines are far from intelligent creatures that can feel thebeauty of a cool design or enjoy the sounds and movement in movies. Instead, search enginescrawl the Web, looking at particular site items (mainly text) to get an idea what a site isabout. This brief explanation is not the most precise because as we will see next, searchengines perform several activities in order to deliver search results – crawling, indexing,processing, calculating relevancy, and retrieving.
    • First, search engines crawl the Web to see what is there. This task is performed by a piece ofsoftware, called a crawler or a spider (or Googlebot, as is the case with Google). Spidersfollow links from one page to another and index everything they find on their way. Having inmind the number of pages on the Web (over 20 billion), it is impossible for a spider to visit asite daily just to see if a new page has appeared or if an existing page has been modified,sometimes crawlers may not end up visiting your site for a month or two.What you can do is to check what a crawler sees from your site. As already mentioned,crawlers are not humans and they do not see images, Flash movies, JavaScript, frames,password-protected pages and directories, so if you have tons of these on your site, youdbetter run the Spider Simulator below to see if these goodies are viewable by the spider. Ifthey are not viewable, they will not be spidered, not indexed, not processed, etc. - in a wordthey will be non-existent for search engines.After a page is crawled, the next step is to index its content. The indexed page is stored in agiant database, from where it can later be retrieved. Essentially, the process of indexing isidentifying the words and expressions that best describe the page and assigning the page toparticular keywords. For a human it will not be possible to process such amounts ofinformation but generally search engines deal just fine with this task. Sometimes they mightnot get the meaning of a page right but if you help them by optimizing it, it will be easier forthem to classify your pages correctly and for you – to get higher rankings.When a search request comes, the search engine processes it – i.e. it compares the searchstring in the search request with the indexed pages in the database. Since it is likely that morethan one page (practically it is millions of pages) contains the search string, the search enginestarts calculating the relevancy of each of the pages in its index with the search string.There are various algorithms to calculate relevancy. Each of these algorithms has differentrelative weights for common factors like keyword density, links, or metatags. That is whydifferent search engines give different search results pages for the same search string. What ismore, it is a known fact that all major search engines, like Yahoo!, Google, Bing, etc.periodically change their algorithms and if you want to keep at the top, you also need to adaptyour pages to the latest changes. This is one reason (the other is your competitors) to devotepermanent efforts to SEO, if youd like to be at the top.The last step in search engines activity is retrieving the results. Basically, it is nothing morethan simply displaying them in the browser – i.e. the endless pages of search results that aresorted from the most relevant to the least relevant sites.IndexingSearch engine indexing collects, parses, and stores data to facilitate fast and accurateinformation retrieval. Index design incorporates interdisciplinary concepts from linguistics,cognitive psychology, mathematics, informatics, physics and computer science. An alternatename for the process in the context of search engines designed to find web pages on theInternet is Web indexing.
    • Popular engines focus on the full-text indexing of online, natural language documents. Mediatypes such as video and audio and graphics are also searchable.Meta search engines reuse the indices of other services and do not store a local index,whereas cache-based search engines permanently store the index along with the corpus.Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size.Larger services typically perform indexing at a predetermined time interval due to therequired time and processing costs, while agent-based search engines index in real time.1 Search engine architecture[http://www.ibm.com/]Architecture overviewThe architecture of a common Web search engine contains a front-end process and a back-end process, as shown in Figure 1. In the front-end process, the user enters the search wordsinto the search engine interface, which is usually a Web page with an input box. Theapplication then parses the search request into a form that the search engine can understand,and then the search engine executes the search operation on the index files. After ranking, thesearch engine interface returns the search results to the user. In the back-end process, a spideror robot fetches the Web pages from the Internet, and then the indexing subsystem parses theWeb pages and stores them into the index files. If you want to use Lucene to build a Websearch application, the final architecture will be similar to that shown in Figure 1.Figure 1. Web search engine architecture
    • Implement advanced search with LuceneLucene supports several kinds of advanced searches, which Ill discuss in this section. Ill thendemonstrate how to implement these searches with Lucenes Application ProgrammingInterfaces (APIs).Boolean operatorsMost search engines provide Boolean operators so users can compose queries. TypicalBoolean operators are AND, OR, and NOT. Lucene provides five Boolean operators: AND,OR, NOT, plus (+), and minus (-). Ill describe each of these operators. • OR: If you want to search for documents that contain the words "A" or "B," use the OR operator. Keep in mind that if you dont put any Boolean operator between two search words, the OR operator will be added between them automatically. For example, "Java OR Lucene" and "Java Lucene" both search for the terms "Java" or "Lucene." • AND: If you want to search for documents that contain more than one word, use the AND operator. For example, "Java AND Lucene" returns all documents that contain both "Java" and "Lucene." • NOT: Documents that contain the search word immediately after the NOT operator wont be retrieved. For example, if you want to search for documents that contain "Java" but not "Lucene," you may use the query "Java NOT Lucene." You cannot use this operator with only one term. For example, the query "NOT Java" returns no results. • +: The function of this operator is similar to the AND operator, but it only applies to the word immediately following it. For example, if you want to search documents that must contain "Java" and may contain "Lucene," you can use the query "+Java Lucene." • -: The function of this operator is the same as the NOT operator. The query "Java -Lucene" returns all of the documents that contain "Java" but not "Lucene."Now look at how to implement a query with Boolean operators using Lucenes API. Listing 1shows the process of doing searches with Boolean operators.Field searchLucene supports field search. You can specify the fields that a query will be executed on. Forexample, if your document contains two fields, Title and Content, you can use the query"Title: Lucene AND Content: Java" to search for documents that contain the term "Lucene"in the Title field and "Java" in the Content field. Listing 2 shows how to use Lucenes API todo a field search.Wildcard searchLucene supports two wildcard symbols: the question mark (?) and the asterisk (*). You canuse ? to perform a single-character wildcard search, and you can use * to perform a multiple-
    • character wildcard search. For example, if you want to search for "tiny" or "tony," you canuse the query "t?ny," and if you want to search for "Teach," "Teacher," and "Teaching," youcan use the query "Teach*." Listing 3 demonstrates the process of doing a wildcard search.Fuzzy searchLucene provides a fuzzy search thats based on an edit distance algorithm. You can use thetilde character (~) at the end of a single search word to do a fuzzy search. For example, thequery "think~" searches for the terms similar in spelling to the term "think." Listing 4features sample code that conducts a fuzzy search with Lucenes API.Range searchA range search matches the documents whose field values are in a range. For example, thequery "age:[18 TO 35]" returns all of the documents with the value of the "age" field between18 and 35. Listing 5 shows the process of doing a range search with Lucenes API.2 Searching a small national domain--Preliminary report András A. Benczúr Károly Csalogány Dániel Fogaras Eszter Friedman Tamás Sarlós Máté Uher Eszter Windhager Computer and Automation Research Institute, Hungarian Academy of Sciences (MTA SZTAKI) 11 Lagymanyosi u., H--1111 Budapest, Hungary Eötvös University, Budapest, and Budapest University of Technology and Economics {benczur,cskaresz,fd,feszter,stamas,umate,hexapoda}@ilab.sztaki.hu http://www.ilab.sztaki.hu/websearchABSTRACTSmall languages represent a non-negligible portion of the Web with interest for a largepopulation with less literacy in English. Existing search engine solutions however vary inquality mostly because a few of these languages have a particularly complicated syntax thatrequires communication between linguistic tools and "classical" Web search techniques. Inthis paper we present development stage experiments of a search engine for the .hu or othersimilar national domains. Such an engine differs in several design issues from large-scaleengines; as an example we apply efficient crawling and indexing policies that may enablebreaking news search.KeywordsWeb crawling, search engines, database freshness, refresh policies.
    • 1. INTRODUCTIONWhile search engines face the challenge of billions of Web pages with content in English orsome other major language, small languages such as the Hungarian represent a quantity sizesof magnitude smaller. A few of the small languages have particularly complicated syntax--most notably Hungarian is one of the languages that troubles a search engine designer themost by requiring complex interaction between linguistic tools and ranking methods.Searching such a small national domain is of commercial interest for a non-negligiblepopulation; existing solutions vary in quality. Despite of its importance the design,experiments or benchmark tests about small-range national search engine development onlyexceptionally appear in the literature.A national domain such as the .hu covers a moderate size of not much more than ten millionHTML pages and concentrates most of the national language documents. The size,concentration and network locality allows an engine to run on an inexpensive architecturewhile it may by orders of magnitude outperform large-scale engines in keeping the index upto date. In our experiments we use a refresh policy that may fit for the purpose of a newsagent. Our architecture will hence necessarily differ from those appearing in publishedexperiments or benchmark tests at several points.In this paper we present development stage experiments of a search engine for the .hu orother similar domain that may form the base of a regional or a distributed engine. Peculiar tosearch over small domains is our a refresh policy and the extensive use of clustering methods.We revisit documents in a more sophisticated way than in [13]; the procedure is based on ourexperiments that show a behavior different from [7] over the Hungarian web. Our method fordetecting domain boundaries improves on [8] by using hyperlink graph clustering to aidboundary detection. We give a few examples that show the difficulty of Web informationretrieval in agglutinative and multi-language environments; natural language processing inmore detail is however not addressed in this work.2. ARCHITECTUREThe high-level architecture of a search engine is described in various surveys [2, 4, 13]. Sincewe only crawl among a few tens of millions of pages, our engine also contains a mixture of anews agent and a focused crawler. Similar to the extended crawl model of [5], we includelong-term and short-term crawl managers. The manager determines the long-term schedulerefresh policy based on PageRank [2] information. Our harvester (modified Larbin 2.6.2 [1])provides short-term schedule and also serves as a seeder by identifying new URLs.Eventually all new URLs are visited or banned by site-specific rules. We apply a two-levelindexer (BerkeleyDB 3.3.11 [12]) based on [11] that updates a temporary index in frequentbatches; temporary index items are merged into permanent index at off-the-peak times and atcertain intervals the entire index is recompiled.Due to space limitations a number of implementation issues is described in the full papersuch as the modifications of the crawler, its interface to long-term refresh scheduler as well asindexing architecture and measurements.
    • Figure 1: Search engine architecture3. INFORMATION RETRIEVAL IN AN AGGLUTINATIVE AND MULTI-LANGUAGEENVIRONMENTHungarian is one of the languages that troubles a search engine designer with lost hits orsevere topic drifts when the natural language issues are not properly handled and integratedinto all levels of the engine itself. We found single words with over 300 different word formoccurrences in Web documents (Figure 2); in theory this number may be as large as tenthousand while compound rules are near as permissive as in German. Word forms of stopwords may or may not be stop words themselves. Technical documents are often filled withEnglish words and documents may be bi- or multilingual. Stemming is expensive and thus,unlike in classical IR, we stem in large batches instead of at each word occurrence.
    • Figure 2: number of word stems with given number of word form occurrences in 5 million Hungarian language documentsPolysemy often confuses search results and makes top hits irrelevant to the query. "Java", themost frequently cited example of a word with several meanings, in Hungarian also means"majority of", "its belongings", "its goods", "its best portion", a type of pork, and may also beincorrectly identified as an agglutination of a frequent abbreviation in mailing lists.Synonyms "dog" and "hound" ("kutya" and "eb") occur nonexchangeably in compounds suchas "hound breeders" and "dog shows" while one of them also stands for the abbreviation ofEuropean Championship. Or the name of a widely used data mining software "sas" translatesback to English as "eagle" that in addition frequently occurs in Hungarian name compounds.In order to achieve acceptable precision in Web information retrieval for multilingualenvironments or agglutinative languages with a very large number of word forms we need toconsider a large number of mixed linguistic, ranking and IR issues. • We index stems at appropriate levels as for example in (((áll)am)((((ad)ó)s)ság)) = ((stand)state)((((give)tax)debtor)debt) we might not even want to stem at all. Too deep stemming causes topic drift; weak stemming suffices since relevant hits are expected to contain the query word several times and among its forms the stem is likely to occur.
    • • Efficient phrase search is nontrivial on its own [3, 13], we also need the original word form in the index for this task. • In order to rank the relevance of an index entry we may need to know the syntax of the sentence that contains the word or use document word frequency clustering results. • Document language(s) and possible missing accents (typical in mailing list archives) must also be taken into consideration or else the index term easily changes meaning (beer, for example, becomes queue with no accents--"sör" and "sor"). • Translations of the stems and forms between Hungarian and English helps ranking algorithms that use anchor text information. • All above issues tend to increase index granularity and the amount of additional information stored and thus space requirement must be carefully optimized.4. RANKING AND DOMAIN CLUSTERINGA unique possibility in searching a moderate size domain is the extensive and relativeinexpensive application of clustering methods. As a key application we determine coherentdomains or multi-part documents. As noticed by Davison [8], hyperlink analysis yieldsaccurate quality measures only if applied to a higher level link structure of inter-domainlinks. The URL text analysis method [8] however often fails for the .hu domain, resulting inunfair PageRank values for a large number of sites without using our clustering method.5. REFRESH POLICYOur refresh policy is based on the observed refresh time of the page and its PageRank [2]. Weextend the argument of Cho et al. [6] by weighting freshness measures by functions of thePageRank pr: given the refresh time refr, we are looking for sync, the synchronization timefunction over pages that maximize PageRank times expected freshness. The optimumsolution of the system can be obtained by solving pr * (refr - (sync + refr) exp (-sync / refr)) = uwhere we let Lagrange multiplier u be maximum such that the download capacity constraint is notexceeded. We compute an approximate solution by increasing refresh rate in discrete steps fordocuments with minimum current u value. The number of equations are reduced by discretizingPageRank and frequency values.We propose another fine tuning for efficiently crawling breaking news. Documents that neednot be revisited every day may be safely scheduled for off-peak hours, thus during daytimewe concentrate on news sites, sites with high PageRank and quick changes. At off-peak hourshowever, frequent visits to news portals are of little use and priorities may be modifiedaccordingly.6. EXPERIMENTSWe estimate not much more than ten million "interesting" pages of the .hu domain residing atan approximate 300,000 Web sites. In the current experiment we crawled five million; inorder to obtain a larger collection of valuable documents we need refined site-specific controlover in-site depth and following links that pass arguments.
    • Among 4.7 million files of size over 30 bytes, depending on its settings the language guesserfinds Hungarian language 70-90% while English 27-34% of the time (pages may bemultilingual or different incorrect guesses may be made at different settings). Outside the .huwe found an additional 280,000 pages, mostly in Hungarian, by simple stop word heuristics;more refined language guessing turned out too expensive for this task.We conducted a preliminary measurement for the lifetime of HTML documents. Unlikesuggested by [7], our results in Figure 3 do not confirm a relation of refresh rates and thePageRank of any kind. Hence we use the computed optimal schedule based on observedrefresh rate and the PageRank; in this schedule documents of a given rank are reloaded morefrequently as their lifetime decreases until a certain point where freshness updates are quicklygiven up. Figure 3: Scatterplot of PageRank and average lifetime (in minutes) of 3 million pages of the .hu domain. In order to amplify low rank pages we show 1/Pagerank on the horizontal axis.We found that pre-parsed document text as in [10] provides good indication of a contentupdate. Portals and news sites should in fact be revisited by more sophisticated rules: forexample only a few index.htmls need to be recrawled and links must be followed in order to
    • keep the entire database fresh. Methods for indexing only content and not the automaticallygenerated navigational and advertisement blocks are given in [9].7. ACKNOWLEDGMENTSTo MorphoLogic Inc. for providing us with their language guesser (LangWitch) and theirHungarian stemming module (HelyesLem).Research was supported from NKFP-2/0017/2002 project Data Riddle and various ETIK,OTKA and AKP grants.3 Architecture of a Search Engine - Components and Process [http://www.beatgoogleusa.com/]Any internet search engine consists of two major parts - a Back End Database (server side)and a GUI (client side)... Basic Search Engine (architecture)...to facilitate the user to type the search term. On the server side, the process involves creation of adatabase and its periodic updating done by a software called Spider. The spider "crawls" the URLpages periodically and indexes the crawled pages in the database. The hyperlinked nature of theInternet makes it possible for the spider to traverse the web. The interface between the client andserver side consists of matching the posted query with the entries in the database and retrieving thematched URLs to the users machine.The spider crawls the web pages through the hyperlinks. In this process it extracts the title,
    • keywords, and any other related information needed for the database from the HTML document.Sometimes, the entire content of the HTML document, (but for the stop words - very common wordssuch as for, is etc.), is extracted and indexed in the database. This is based on the idea that a pagedealing with a particular issue will have relevant words throughout its page. Thus indexing all thewords in a document increases the probability of getting the relevant URLs to a query. One point isworth noting here: before the query words are processed they are removed of the morphologicalinflections before they are searched for in the database. The spider is also referred to by names:"Robot", "Crawler", "Indexer" etc. The database consists of a number of tables arranged to aid inquick retrieval of the data. With the number of sites increasing it is common for search engines tomaintain more than one database server. When the user queries for term(s), these particular term(s)is(are) searched in the database. The sites in which these term(s) are present are identified. Thenthese sites are ranked on the basis of the relevancy they have with the user query. The ranked sitesare then displayed, with links to these sites and a small description taken from the site itself so as togive an idea to the user about the site.Five key building blocks of a crawling search engine -CRAWLER (or ROBOT) - a specialised automated program that follows links found on web pages, anddirects the spider where to go next by finding new sites for it to visit. When you add your URL to aSearch Engine, it is the crawler you are requesting to visit your site.SPIDER (or ROBOT) - an automatic browser-like program that downloads documents found on theweb by the crawler. It works very much as a browser does when it connects to a website anddownloads pages. Most spiders arent interested in images though, and dont ask for them to besent.INDEXER - a program that "reads" the pages that are downloaded by spiders. This does most of thework deciding what your site is about. The words in the site are "read". Some are thrown away, asthey are so common (and, it, the etc). It will also examine the HTML code which makes up your sitelooking for other clues as to which words you consider to be important. Words in bold, italic orheaders tags will be given more weight. This is also where the meta information (the keywords anddescription tags) for your site will be analysed.DATABASE - index for storage of the pages downloaded and processed. It is where the informationgathered by the indexer is stored.RESULTS ENGINE - generates search results out of the database, according to your query. This is themost important part of any Search Engine. The results engine is the customer-facing (UI) portion of aSearch Engine, and as such is the focus of most optimisation efforts. It is the results engines functionto return the pages most relevant to a users query. When a user types in a keyword or phrase, theresults engine must decide which pages are most likely to be useful to the user. The method it usesto decide that is called its "algorithm". You may hear Search Engine Optimisation (SEO) expertsdiscuss "algos" or "breaking the algo" for a particular search engine. After all, if you know what thecriteria being used are, you can write pages to take advantage of them.