• Save

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this document? Why not share!

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Good One Mr. Sovan....
    Are you sure you want to
    Your message goes here
  • how to download this slide???
    Are you sure you want to
    Your message goes here
  • this is very good website for project report
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
2,805
On Slideshare
2,804
From Embeds
1
Number of Embeds
1

Actions

Shares
Downloads
80
Comments
3
Likes
3

Embeds 1

https://twitter.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERINGSYNERGY INSTITUTE OF ENGINEERING & TECHNOLOGY, DHENKANAL SEMINAR ON How a search engine works?
  • 2. Seminar Report ’10 How a search engine works Guided by : XxX Submitted by: SOVAN MISRA CS-O7-42 0701230147Dept. of CSE 2 S.I.E.T, Dhenkanal
  • 3. Seminar Report ’10 How a search engine works DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING SYNERGY INSTITUTE OF ENGINEERING AND TECHNOLOGY DHENKANAL CERTIFICATECertified that this is a bonafide record of the seminar entitled “HOW ASEARCH ENGINE WORKS” done by the following student “SOVANMISRA” of the 7th semester, Computer Science and Engineering in theyear 2010 in partial fulfilment of the requirements of the award ofDegree of Bachelor of Technology in Computer Science andEngineering of Synergy Institute Of Engineering And Technology,Dhenkanal XxX XxX Seminar Guide Head of the DepartmentDept. of CSE 3 S.I.E.T, Dhenkanal
  • 4. Seminar Report ’10 How a search engine works ACKNOWLEDGEMENTI thank my seminar guide XxX, Lecturer, SIET, for her properguidance, and valuable suggestions. I am indebted to XxX, the HOD,Computer Science department & other faculty members for giving mean opportunity to learn and do this seminar. If not for the abovementioned people my seminar would never have been completedsuccessfully. I once again extend my sincere thanks to all of them. SOVAN MISRADept. of CSE 4 S.I.E.T, Dhenkanal
  • 5. Seminar Report ’10 How a search engine works HOW SEARCH ENGINE WORKS? INTRODUCTIONWhat is a search engine?Search engine is a software program that searches a database and gathersreports, information that contains or is related to specified terms.OrIt is a website whose primary function is providing a search for gatheringand reporting information’s available on the Internet or a portion ofinternet.Why Search Engine?In today’s world we have million and billions of information available inthe vast World Wide Web (WWW). If one has to search some informationit will kill lots of time of the user. For this purpose we should have certaintools for making this searching automatic, Quick, and Effortless.So, to reduce the problem to a , more or less manageable solution, WebSearch Engine were introduced a few years ago.Different Search engines:Dept. of CSE 5 S.I.E.T, Dhenkanal
  • 6. Seminar Report ’10 How a search engine worksHistory of search engines:-In 1990, the first search engine ARCHIE was released, at that time there isno World Wide Web. Data resided on defence contractor, university, andgovernment computers and techies were the only people accessing the data.The computers are interconnected by Telnet*.File Transfer Protocol (FTP) used for transferring file from computer tocomputer.There is no such thing called a Browser. So, information or data aretransferred in their native format and viewed using the associated file typesoftware.Archie searched FTP servers and indexed their files into a searchabledirectory.In 1991, Ghopherspace come into existence with the advantage of Gopher.It catalogued FTP sites and the resulting catalogue become known asGopher space.In 1994, WebCrawler, a new type of search engine that indexed the entirecontents pf a webpage, was introduced.In between 1995-1998, many changes and development occurred in theworld of search engines. Meta tags* is the webpage were first utilized bysome search engines to determine relevancy.Search engine rank-checking software was introduced. It provides anautomated tool to determine web sites position and ranking within themajor search engines.In around 1998, search engine Algorithms was introduced to optimize thesearching.In 2000, Marketer determines that pay-per click campaigns were an easyyet expensive approach for gaining top search rankings. To elevate sites inthe searching engine ranking websites started adding useful and relevantcontent while optimizing theirWebPages for each specific search engines. And still the search enginesoptimization (SEO) is going on by improving the algorithms.TYPE OF SEARCH ENGINES:Dept. of CSE 6 S.I.E.T, Dhenkanal
  • 7. Seminar Report ’10 How a search engine worksOn the basis of working, search engine is categories in the followinggroups: * Crawler-based search engine. * Directories. * Hybrid search engines. * Meta search engine.CRAWLER BASED SEARCH ENGINE:It uses automated software programs to survey and categorises WebPages,which is known as “spiders” ,”crawlers” ,”robots” and ”bots”.A spider will find a web page, download it and analyses the informationpresented on the WebPages. The webpage will then be added to the searchengines database.When a user performs a search , the search engine will check its databaseof WebPages for the key word the user searched.The results are ordered as per the bots algorithm in the search engine resultpages (SERPs).Ex:- www.google.com) GOOGLE ( ASK (www.ask.com)Dept. of CSE 7 S.I.E.T, Dhenkanal
  • 8. Seminar Report ’10 How a search engine worksSPIDER’S ALGORITHMS :All spiders use the following algorithms for retrieving documents from theweb:The algorithm uses a list of known URLs. This lists contains at least oneURL to start with.The document is parsed to retrieve information for the index database andto extract the embedded link to other documents.Dept. of CSE 8 S.I.E.T, Dhenkanal
  • 9. Seminar Report ’10 How a search engine worksThe URL of the links found in the document are added to the list of knownURLs.If the list is empty or some limit exceed (number of documents retrieved,size of the indexed database, etc) the algorithm stops, otherwise thealgorithm continues at steps 2.Crawlers program treats World Wide Web as big graph having pages asnodes and the hyperlinks as arcs. Crawlers works with a simple goal, indexing all keywords in thewebpage titles.Three Data structures is needed for crawlers or spider algorithms A large linear array, URL_Table. • Heap • Hash tableDept. of CSE 9 S.I.E.T, Dhenkanal
  • 10. Seminar Report ’10 How a search engine works • URL_table:It is a large linear array that contains millions of entries.Each entry contains two pointers: • Pointer to URL • Pointer to Title.These are variables length strings and kept as heap.Heap:It is a large unstructured chunk of virtual memory to which strings can beappended.Hash table:It is the third data structure of size ‘n’ entriesAny URL can be run through a hash function to produce a non-negativeinteger less than ‘n’.All URL that hash to the value ‘k’ are hooked together on a linked liststarting at the entry ‘k’ of the hash table.Every entry in the URL_table is also entered into the hash table.The main use of hash table is to start with a URL and be able to quicklydetermine whether it is already present in URL_Table.Dept. of CSE 10 S.I.E.T, Dhenkanal
  • 11. Seminar Report ’10 How a search engine works DATA STRUCTURE FOR CRAWLERBuilding the index requires two phases: • Searching (URL processing ) • Indexing.The heart of the search engine is a recursive procedure procees_url, whichtakes a URL string as inputSearching is done by procedure, procees_url as follows:- It hashes the URL to see if it is already present in url_table. If so, itis done and returns immediately.If the URL is not already known, its page is fetched. The URL and title are then copied to the heap and pointers to thesetwo strings are entered in url_table. The URL is also entered into the hash table. Finally, process_url extracts all the hyperlinks from the page andcalls process_url once per hyperlink, passing the hyperlink’s URL as theinput parameter For each entry in url_table, indexing procedure will examine the titleand selects out all words not on the stop list.Dept. of CSE 11 S.I.E.T, Dhenkanal
  • 12. Seminar Report ’10 How a search engine works Each selected word is written on to a file with a line consisting of theword followed by the current url_table entry number.When the whole table has been scanned, the file is shorted by word.Formulating quires: Keyword submission cause a request to be done in the machinewhere the index is located (web server). Then the keyword is looked up in the index database to find the setof URL indices for each keyword. Indexed into url_table to find all the titles and urls. Then it is storedin the Document server.These are then combined to form a web page and sent back to user as theresponse.Determining RelevanceClassic algorithm "TF / IDF“is used for determining relevance.Dept. of CSE 12 S.I.E.T, Dhenkanal
  • 13. Seminar Report ’10 How a search engine works  It is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection  A high weight in TF-IDF is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents Term Frequency  “Term Frequency” -The number of times a given term appears in that document.  It gives a measure of the importance of the term ti within the particular document. Term Frequency, Where, ni is the number of occurrences of the considered term, andthe denominator is the number of occurrences of all terms.E.g. If a document contains 100 total words and the word computerappears 3 times, then the term frequency of the word computer in thedocument is 0.03 (3/100)Inverse Document FrequencyThe “inverse document frequency ”is a measure of the general importanceof the term (obtained by dividing the number of all documents by thenumber of documents containing the term, and then taking the logarithm ofthat quotient).Where, • | D | : total number of documents in the corpusDept. of CSE 13 S.I.E.T, Dhenkanal
  • 14. Seminar Report ’10 How a search engine works • : number of documents where the term ti appears (that is ni!= 0)Inverse Document Frequency There are different ways of calculating the IDF “Document Frequency” (DF) is to determine how many documents contain the word and divide it by the total number of documents in the collection. E.g. 1) If the word computer appears in 1,000 documents out of a total of 10,000,000 then the IDF is 0.0001 (1000/10,000,000). 2) Alternately, take the log of the document frequency. The natural alogarithm is commonly used. In this example we would have IDF = ln(1,000 / 10,000,000) =1/ 9.21The final TF-IDF score is then calculated by dividing the “TermFrequency” by the “Document Frequency”.E.g. The TF-IDF score for computer in the collection would be : 1)TF-IDF = 0.03/0.0001= 300 , by using first formula of IDF. 2)If alternate formula used we would have TF-IDF = 0.03 * 9.21 = 0.27.Dept. of CSE 14 S.I.E.T, Dhenkanal
  • 15. Seminar Report ’10 How a search engine works OTHER TYPE OF SERCHING TECHNIQUES: Directories The human editors comprehensively check the website and rank it, based on the information they find, using a pre-defined set of rules. There are two major directories : Yahoo Directory (www.yahoo.com) Open Directory (www.dmoz.org)Hybrid Search Engines Hybrid search engines use a combination of both crawler-basedresults and directory results.Dept. of CSE 15 S.I.E.T, Dhenkanal
  • 16. Seminar Report ’10 How a search engine works Examples of hybrid search engines are: Yahoo (www.yahoo.com) Google (www.google.com) Meta Search Engines Also known as Multiple Search Engines or Metacrawlers. Meta search engines query several other Web search engine databases in parallel and then combine the results in one list. Examples of Meta search engines include: Metacrawler (www.metacrawler.com) Dogpile (www.dogpile.com)References:http://computer.howstuffworks.com/internet/basics/search-engine.htmhttp://searchenginewatch.com/2168031http://www.infotoday.com/searcher/may01/liddy.htmhttp://www.slideshare.net/jsuleiman/how-search-engines-work-presentationHey!This is SovanPlease send your feedbacks @sovan107@gmail.comDept. of CSE 16 S.I.E.T, Dhenkanal