A Novel And Efficient ApproachFor Near Duplicate PageDetection In Web CrawlingVIPIN KP       Guided by: Mr . Aneesh M Hane...
Presentation Outline   Introduction   What are near duplicates   Drawbacks of near duplicate pages   What is a Web cra...
Introduction   The main gateways for access of a information in    the web are search engines .   A search engine operat...
Introduction cont‟d…  Web search engines face additional problems   due to near duplicate web pages. It is an important ...
What are near duplicates ? The near duplicates are not considered as “exact  duplicates ” , but are files with minute  di...
What are near duplicates ?   http://shop.asus.co.uk/shop/gb/en-gb/home.aspx                           1/2/2012            ...
What are near duplicates ?   http://shop.asus.es/shop/gb/en-gb/home.aspx                           1/2/2012              7
Drawbacks of Near Duplicate webpages   Waste network bandwidth   Increase storage cost   Affect the quality of search i...
Web Crawler A Web crawler is a computer program that browses  the World Wide Web in an orderly fashion. Other terms for ...
Simplified Crawl Architecture         one document    HTML              traverse                        Documen           ...
Near Duplicate Detection The Steps Involved In This Approach Are, Web document parsing Stemming algorithm Keyword repre...
Near Duplicate Detection  cont‟d…Web Document Parsing:• It may either be simple as URL extraction or complexas removing th...
Near Duplicate Detection cont‟d…Stemming Algorithm:•Stemming is the process for reducing derived words totheir stem, base ...
Near Duplicate Detection cont‟d…Stemming Algorithm cont’d..•The prefix removal algorithm removes:   anti,bi,co,contra,de,d...
Near Duplicate Detectioncont‟d…Key Word Representation:• Keywords and their counts in each crawled pageis the result of st...
Near Duplicate Detection  cont‟d…Similarity score calculation:• If prime keywords of the new web page do not matchwith the...
Near Duplicate Detectioncont‟d…                    K1                K2            ………..                   Kn             ...
Near Duplicate Detection   cont‟d…• If keywords present in T1 but not in T2 and amount of keywords prese   is NT1 then    ...
Near Duplicate Detectioncont‟d…• The web documents with similarity score greater than  a predefined threshold are consider...
Advantages• Save the network bandwidth• Reduce storage cost of search engines• Improve the quality of search index        ...
Conclusion• The proposed method solve the difficulties of  information retrieval from the web.• The approach has detected ...
Reference•   Brin, S., Davis, J. and Garcia-Molina, H. (1995) "Copy detection    mechanisms for digital documents", In Pro...
Questions1/2/2012         23
Thank you    1/2/2012   24
Upcoming SlideShare
Loading in...5
×

novel and efficient approch for detection of duplicate pages in web crawling

13,157

Published on

Published in: Lifestyle, Technology, Design
2 Comments
1 Like
Statistics
Notes
  • i need full source code and design for novel efficient approach near dupliacte web page detection.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • thanks a lot.....it is very useful for my project....but i can't download it
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
13,157
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
34
Comments
2
Likes
1
Embeds 0
No embeds

No notes for slide

novel and efficient approch for detection of duplicate pages in web crawling

  1. 1. A Novel And Efficient ApproachFor Near Duplicate PageDetection In Web CrawlingVIPIN KP Guided by: Mr . Aneesh M Haneef08103066 Asst . ProfessorS7 CSE A Department ofCSE,MESCE
  2. 2. Presentation Outline Introduction What are near duplicates Drawbacks of near duplicate pages What is a Web crawler Simplified Crawl Architecture Near duplicate detection Advantages Conclusion Reference 1/2/2012 2
  3. 3. Introduction The main gateways for access of a information in the web are search engines . A search engine operates in the following order: Web crawling Indexing Searching Web crawling ,a process that create a indexed repository utilized by the search engines. The large amount of web documents in the web have huge challenges to the search engine making their results less relevant to the user. 1/2/2012 3
  4. 4. Introduction cont‟d… Web search engines face additional problems due to near duplicate web pages. It is an important requirements for search engines to provide users with relevant results without duplication.  Near duplicate page detection is a challenging problem. 1/2/2012 4
  5. 5. What are near duplicates ? The near duplicates are not considered as “exact duplicates ” , but are files with minute differences . They differ slightly in advertisement, counters , timestamps , etc… Most of the web sites have boiler plate codes. 1/2/2012 5
  6. 6. What are near duplicates ? http://shop.asus.co.uk/shop/gb/en-gb/home.aspx 1/2/2012 6
  7. 7. What are near duplicates ? http://shop.asus.es/shop/gb/en-gb/home.aspx 1/2/2012 7
  8. 8. Drawbacks of Near Duplicate webpages Waste network bandwidth Increase storage cost Affect the quality of search indexes Increase the load on the remote host that is serving such web pages Affect customer satisfaction 1/2/2012 8
  9. 9. Web Crawler A Web crawler is a computer program that browses the World Wide Web in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots , Web spiders, Web robots. Search engines uses web crawlers to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. This indexed database will use for searching process. A crawler may examine the URL if it ends with certain characters such as .html, .htm, .asp, .aspx, .php, .jsp, .jspx or a slash. Some crawlers may also avoid requesting any resources that have a "?"1/2/2012 in them. 9
  10. 10. Simplified Crawl Architecture one document HTML traverse Documen t links Web Index Web entire index Near- duplicate ? newly-crawled document(s) insert trash 1/2/2012 10
  11. 11. Near Duplicate Detection The Steps Involved In This Approach Are, Web document parsing Stemming algorithm Keyword representation Similarity score calculation 1/2/2012 11
  12. 12. Near Duplicate Detection cont‟d…Web Document Parsing:• It may either be simple as URL extraction or complexas removing the HTML tags and java scripts from a webpage.•Stop Word Removal Remove commonly used words such as „an, „and‟, ‟the‟ ,‟to‟ , ‟with‟ , ‟by‟ , ‟for‟ etc…It helps to reduce thesize of the indexing file. 1/2/2012 12
  13. 13. Near Duplicate Detection cont‟d…Stemming Algorithm:•Stemming is the process for reducing derived words totheir stem, base or root form—generally a written wordform.•The relation between a query and a document isdetermined by the number and frequency of termswhich they have common.•Affix removal algorithms remove suffixes and/orprefixes from terms leaving a stem. eg : “connect”, “connected”,” connecting” are allcondensed to connect. 1/2/2012 13
  14. 14. Near Duplicate Detection cont‟d…Stemming Algorithm cont’d..•The prefix removal algorithm removes: anti,bi,co,contra,de,di,des,en,inter,intra,mini,multi,pre,pro•The suffix removal algorithm removes: ly,ness,ioc,iez,able,ance,ary,ce,y,dom,ee,eer,ence,ory,o• The derivation are converted to their stems which are rela to original in both form and semantics. 1/2/2012 14
  15. 15. Near Duplicate Detectioncont‟d…Key Word Representation:• Keywords and their counts in each crawled pageis the result of stemming• Keywords are sorted in descending order basedon the counts• Keywords with highest counts are called primekeywords stored in table and the remaining indexedand stored in another table. 1/2/2012 15
  16. 16. Near Duplicate Detection cont‟d…Similarity score calculation:• If prime keywords of the new web page do not matchwith the prime keywords of the pages in the table then newpage is added to the repository.• If all the keywords of the both pages are same then newpage is a duplicate.• If prime keywords of the both pages are same thensimilarity score (SSM) is calculated as follows. 1/2/2012 16
  17. 17. Near Duplicate Detectioncont‟d… K1 K2 ……….. Kn C1 C2 ……….. Cn Table of web page in the repository containing keywords and count K1 K2 ………… Kn C1 C2 …………. Cn Table of new web page containing keywords and count If a key word present in both tables then a=Δ[ki]T1 b=Δ[ki]T2 Using the formula SDc=log(count(a)/count(b))*Abs(1+(a-b)) 1/2/2012 17
  18. 18. Near Duplicate Detection cont‟d…• If keywords present in T1 but not in T2 and amount of keywords prese is NT1 then SDT1 =log(count(a))*Abs(1+|T2|)• If keywords present in T2 but not in T1 and amount of keywords prese is NT2 then SDT2 =log(count(b))*Abs(1+|T1|)• The similarity score of page against another page is calculated by |NC| |NT1| |NT@| ΣSDC + ΣSDT1 + ΣSDT2 i=1 i=1 i=1 SSM = N Where N=(|T1|+|T2|)/2 1/2/2012 18
  19. 19. Near Duplicate Detectioncont‟d…• The web documents with similarity score greater than a predefined threshold are considered as near duplicates• These near duplicated pages are not added to the repository of search engine 1/2/2012 19
  20. 20. Advantages• Save the network bandwidth• Reduce storage cost of search engines• Improve the quality of search index 1/2/2012 20
  21. 21. Conclusion• The proposed method solve the difficulties of information retrieval from the web.• The approach has detected the near duplicate web pages efficiently based on the keywords extracted from the web pages.• It reduces the memory space for web repositories.• The near duplicate detection increases the search engines quality. 1/2/2012 21
  22. 22. Reference• Brin, S., Davis, J. and Garcia-Molina, H. (1995) "Copy detection mechanisms for digital documents", In Proceedings of the Special Interest Group on Management of Data (SIGMOD 1995), ACM Press.• Pandey, S.; Olston, C., (2005) "User-centric Web crawling", Proceedings of the 14th international conference on World Wide Web, pp: 401 - 41• Xiao, C., Wang, W., Lin, X., Xu Yu, J.,(2008) "Efficient Similarity Joins for Near Duplicate Detection", Proceeding of the 17th international 443 - 452. conference on World Wide Web, pp:131--140.• Lovins, J.B. (1968) "Development of a stemming algorithm". Mechanical Translation and Computational Linguistics. 1/2/2012 22
  23. 23. Questions1/2/2012 23
  24. 24. Thank you 1/2/2012 24
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×