Authors Muhammad Atif Qureshi Arjumand Younus Francisco Rojas International Conference on Information Science and Applicat...
<ul><li>Introduction </li></ul><ul><li>Implementation Alternatives </li></ul><ul><li>Crawler Architecture </li></ul><ul><l...
Analyzing Web Crawler as  Feed Forward Engine   for Efficient Solution to  Search Problem   in the Minimum Amount of Time ...
<ul><li>Background </li></ul><ul><li>Motivation </li></ul><ul><li>Problem Statement </li></ul><ul><li>Contributions </li><...
<ul><li>Web crawler </li></ul><ul><ul><li>Description </li></ul></ul><ul><ul><ul><li>Program that downloads web pages recu...
<ul><li>Web crawler needs to have highly optimized system architecture with ability to </li></ul><ul><ul><li>Download larg...
<ul><li>Description: analysis of web crawling from a systems’ perspective  </li></ul><ul><li>Issues </li></ul><ul><ul><li>...
<ul><li>First ever threads vs. events debate from web crawlers perspective </li></ul><ul><li>MapReduce architecture for di...
<ul><li>Threads vs. Events </li></ul><ul><li>Performance Evaluation for Threads vs. Events </li></ul>International Confere...
<ul><li>Problems in Threads </li></ul><ul><ul><li>Large memory footprint </li></ul></ul><ul><ul><li>Context switch overhea...
<ul><li>Environment </li></ul><ul><ul><li>CPU: Intel Pentium 4 Core 2 Duo 3GHz </li></ul></ul><ul><ul><li>RAM: 3.2 GB </li...
International Conference on Information Science and Applications 2010 [ Implementation Alternatives ] No. of Seed URLs wer...
International Conference on Information Science and Applications 2010 [ Implementation Alternatives ] Pool size was kept c...
<ul><li>High Level View of MapReduce Usage </li></ul><ul><li>High Level Distributed Design with MapReduce </li></ul><ul><l...
International Conference on Information Science and Applications 2010 [ Crawler Architecture]
International Conference on Information Science and Applications 2010 The distributed implementation was done with our own...
Target server: yahoo.com Same crawling machines Simultaneous and continuing connections [ Crawler Architecture] Internatio...
[ Crawler Architecture] International Conference on Information Science and Applications 2010 Push Right-side Order URL Po...
IMPLICATIONS International Conference on Information Science and Applications 2010
<ul><li>Observations during implementation of feed forward mechanisms in web crawler </li></ul><ul><ul><li>Exokernel based...
<ul><li>[DG04] Dean, J., and Ghemawat, S., “ MapReduce: simplified data processing on large clusters,” In  Proc. 6 th  Int...
Upcoming SlideShare
Loading in …5
×

Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

1,956 views

Published on

My presentation slides for paper presented in International Conference on Information Science and Applications, ICISA, Seoul 2010.

Paper link: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5480411&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D5480411

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,956
On SlideShare
0
From Embeds
0
Number of Embeds
34
Actions
Shares
0
Downloads
28
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Previous contributions have been made in web servers performance analysis
  • Problems
  • Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

    1. 1. Authors Muhammad Atif Qureshi Arjumand Younus Francisco Rojas International Conference on Information Science and Applications 2010
    2. 2. <ul><li>Introduction </li></ul><ul><li>Implementation Alternatives </li></ul><ul><li>Crawler Architecture </li></ul><ul><li>Implications </li></ul><ul><li>Conclusion </li></ul>International Conference on Information Science and Applications 2010
    3. 3. Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework International Conference on Information Science and Applications 2010
    4. 4. <ul><li>Background </li></ul><ul><li>Motivation </li></ul><ul><li>Problem Statement </li></ul><ul><li>Contributions </li></ul>International Conference on Information Science and Applications 2010
    5. 5. <ul><li>Web crawler </li></ul><ul><ul><li>Description </li></ul></ul><ul><ul><ul><li>Program that downloads web pages recursively by fetching links from a seed of web pages </li></ul></ul></ul><ul><ul><ul><li>Backbone of search engine’s data repository </li></ul></ul></ul><ul><li>Competing factors among search engines </li></ul><ul><ul><li>Coverage of internet </li></ul></ul><ul><ul><li>Throughput of complete download </li></ul></ul>International Conference on Information Science and Applications 2010 [ Introduction ]
    6. 6. <ul><li>Web crawler needs to have highly optimized system architecture with ability to </li></ul><ul><ul><li>Download large number of web pages per second </li></ul></ul><ul><ul><li>Be robust against crashes </li></ul></ul><ul><ul><li>Be manageable and considerate of resources and web servers </li></ul></ul><ul><li>Most of the works focus on “improving strategy for web crawlers” [LLWL08] [SS02] </li></ul><ul><ul><li>Our focus is to provide a convincing analysis of web crawler from system's viewpoint </li></ul></ul>International Conference on Information Science and Applications 2010 [ Introduction ]
    7. 7. <ul><li>Description: analysis of web crawling from a systems’ perspective </li></ul><ul><li>Issues </li></ul><ul><ul><li>Threads vs. events </li></ul></ul><ul><ul><li>Distributed implementation </li></ul></ul><ul><ul><li>Prevention from DDoS attack </li></ul></ul><ul><ul><li>Web crawler as feed forward engine for next phases of search engine </li></ul></ul>International Conference on Information Science and Applications 2010 [ Introduction ]
    8. 8. <ul><li>First ever threads vs. events debate from web crawlers perspective </li></ul><ul><li>MapReduce architecture for distributed web crawler implementation </li></ul><ul><li>Implications towards birth of operating system for Internet based applications e.g. web crawlers </li></ul>International Conference on Information Science and Applications 2010 [ Introduction ]
    9. 9. <ul><li>Threads vs. Events </li></ul><ul><li>Performance Evaluation for Threads vs. Events </li></ul>International Conference on Information Science and Applications 2010
    10. 10. <ul><li>Problems in Threads </li></ul><ul><ul><li>Large memory footprint </li></ul></ul><ul><ul><li>Context switch overhead </li></ul></ul><ul><ul><li>Cache and TLB misses </li></ul></ul><ul><ul><li>Expensive synchronization mechanisms </li></ul></ul><ul><li>Problems in Events </li></ul><ul><ul><li>Add to programmers’ difficulty </li></ul></ul><ul><ul><li>Debugging is troublesome </li></ul></ul>International Conference on Information Science and Applications 2010 [ Implementation Alternatives ]
    11. 11. <ul><li>Environment </li></ul><ul><ul><li>CPU: Intel Pentium 4 Core 2 Duo 3GHz </li></ul></ul><ul><ul><li>RAM: 3.2 GB </li></ul></ul><ul><ul><li>OS: Linux 2.6.28-11-generic </li></ul></ul><ul><li>Experiments </li></ul><ul><ul><li>1 st experiment: </li></ul></ul><ul><ul><ul><li>Comparison of crawler throughput with varying pool size </li></ul></ul></ul><ul><ul><li>2 nd experiment: </li></ul></ul><ul><ul><ul><li>Comparison of crawler throughput with varying seed URL size </li></ul></ul></ul>International Conference on Information Science and Applications 2010 [ Implementation Alternatives ]
    12. 12. International Conference on Information Science and Applications 2010 [ Implementation Alternatives ] No. of Seed URLs were kept constant at 1000
    13. 13. International Conference on Information Science and Applications 2010 [ Implementation Alternatives ] Pool size was kept constant at 200
    14. 14. <ul><li>High Level View of MapReduce Usage </li></ul><ul><li>High Level Distributed Design with MapReduce </li></ul><ul><li>Prevention of DDoS Attack </li></ul>International Conference on Information Science and Applications 2010
    15. 15. International Conference on Information Science and Applications 2010 [ Crawler Architecture]
    16. 16. International Conference on Information Science and Applications 2010 The distributed implementation was done with our own version of MapReduce[DG04] library. [ Crawler Architecture]
    17. 17. Target server: yahoo.com Same crawling machines Simultaneous and continuing connections [ Crawler Architecture] International Conference on Information Science and Applications 2010
    18. 18. [ Crawler Architecture] International Conference on Information Science and Applications 2010 Push Right-side Order URL Pop Left-side Priority 1 a.com 1 2 a.com /a 7 3 1 . a.com 5 4 b.com 2 5 c.net 3 6 1 . b.com 6 7 c.com 4
    19. 19. IMPLICATIONS International Conference on Information Science and Applications 2010
    20. 20. <ul><li>Observations during implementation of feed forward mechanisms in web crawler </li></ul><ul><ul><li>Exokernel based approach favorable for web crawler </li></ul></ul><ul><ul><ul><li>Priority queue control </li></ul></ul></ul><ul><ul><ul><li>Filesystem should not provide consistency guarantees </li></ul></ul></ul><ul><ul><ul><li>Indexing and dictionary concept should be supported by file system </li></ul></ul></ul>SEARCH ENGINE OPERATING SYSTEM International Conference on Information Science and Applications 2010
    21. 21. <ul><li>[DG04] Dean, J., and Ghemawat, S., “ MapReduce: simplified data processing on large clusters,” In Proc. 6 th Int’l Symposium on Operating Systems Design and Implementation , San Francisco, CA, 2004: 137-150. </li></ul><ul><li>[LLWL08] Lee, H.T., Leonard, D., Wang, X., and Loguinov, D., “IRLbot: scaling to 6 billion pages and beyond,” In Proc. 17th Int’l Conf. on World Wide Web , April 21-25, 2008, Beijing, China.  </li></ul><ul><li>[SS02] Shkapenyuk, V. and Suel, T., “Design and Implementation of a High-Performance Distributed Web Crawler,” In Proc. 18th Int’l Conf. on Data Engineering , pp. 3-57, San Jose, California, USA, Feb. 2002. </li></ul>International Conference on Information Science and Applications 2010

    ×