Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Authors Muhammad Atif Qureshi Arjumand Younus Francisco Rojas International Conference on Information Science and Applicat...
<ul><li>Introduction </li></ul><ul><li>Implementation Alternatives </li></ul><ul><li>Crawler Architecture </li></ul><ul><l...
Analyzing Web Crawler as  Feed Forward Engine   for Efficient Solution to  Search Problem   in the Minimum Amount of Time ...
<ul><li>Background </li></ul><ul><li>Motivation </li></ul><ul><li>Problem Statement </li></ul><ul><li>Contributions </li><...
<ul><li>Web crawler </li></ul><ul><ul><li>Description </li></ul></ul><ul><ul><ul><li>Program that downloads web pages recu...
<ul><li>Web crawler needs to have highly optimized system architecture with ability to </li></ul><ul><ul><li>Download larg...
<ul><li>Description: analysis of web crawling from a systems’ perspective  </li></ul><ul><li>Issues </li></ul><ul><ul><li>...
<ul><li>First ever threads vs. events debate from web crawlers perspective </li></ul><ul><li>MapReduce architecture for di...
<ul><li>Threads vs. Events </li></ul><ul><li>Performance Evaluation for Threads vs. Events </li></ul>International Confere...
<ul><li>Problems in Threads </li></ul><ul><ul><li>Large memory footprint </li></ul></ul><ul><ul><li>Context switch overhea...
<ul><li>Environment </li></ul><ul><ul><li>CPU: Intel Pentium 4 Core 2 Duo 3GHz </li></ul></ul><ul><ul><li>RAM: 3.2 GB </li...
International Conference on Information Science and Applications 2010 [ Implementation Alternatives ] No. of Seed URLs wer...
International Conference on Information Science and Applications 2010 [ Implementation Alternatives ] Pool size was kept c...
<ul><li>High Level View of MapReduce Usage </li></ul><ul><li>High Level Distributed Design with MapReduce </li></ul><ul><l...
International Conference on Information Science and Applications 2010 [ Crawler Architecture]
International Conference on Information Science and Applications 2010 The distributed implementation was done with our own...
Target server: yahoo.com Same crawling machines Simultaneous and continuing connections [ Crawler Architecture] Internatio...
[ Crawler Architecture] International Conference on Information Science and Applications 2010 Push Right-side Order URL Po...
IMPLICATIONS International Conference on Information Science and Applications 2010
<ul><li>Observations during implementation of feed forward mechanisms in web crawler </li></ul><ul><ul><li>Exokernel based...
<ul><li>[DG04] Dean, J., and Ghemawat, S., “ MapReduce: simplified data processing on large clusters,” In  Proc. 6 th  Int...
Upcoming SlideShare
Loading in …5
×

Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

1,998 views

Published on

My presentation slides for paper presented in International Conference on Information Science and Applications, ICISA, Seoul 2010.

Paper link: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5480411&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D5480411

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

  1. 1. Authors Muhammad Atif Qureshi Arjumand Younus Francisco Rojas International Conference on Information Science and Applications 2010
  2. 2. <ul><li>Introduction </li></ul><ul><li>Implementation Alternatives </li></ul><ul><li>Crawler Architecture </li></ul><ul><li>Implications </li></ul><ul><li>Conclusion </li></ul>International Conference on Information Science and Applications 2010
  3. 3. Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework International Conference on Information Science and Applications 2010
  4. 4. <ul><li>Background </li></ul><ul><li>Motivation </li></ul><ul><li>Problem Statement </li></ul><ul><li>Contributions </li></ul>International Conference on Information Science and Applications 2010
  5. 5. <ul><li>Web crawler </li></ul><ul><ul><li>Description </li></ul></ul><ul><ul><ul><li>Program that downloads web pages recursively by fetching links from a seed of web pages </li></ul></ul></ul><ul><ul><ul><li>Backbone of search engine’s data repository </li></ul></ul></ul><ul><li>Competing factors among search engines </li></ul><ul><ul><li>Coverage of internet </li></ul></ul><ul><ul><li>Throughput of complete download </li></ul></ul>International Conference on Information Science and Applications 2010 [ Introduction ]
  6. 6. <ul><li>Web crawler needs to have highly optimized system architecture with ability to </li></ul><ul><ul><li>Download large number of web pages per second </li></ul></ul><ul><ul><li>Be robust against crashes </li></ul></ul><ul><ul><li>Be manageable and considerate of resources and web servers </li></ul></ul><ul><li>Most of the works focus on “improving strategy for web crawlers” [LLWL08] [SS02] </li></ul><ul><ul><li>Our focus is to provide a convincing analysis of web crawler from system's viewpoint </li></ul></ul>International Conference on Information Science and Applications 2010 [ Introduction ]
  7. 7. <ul><li>Description: analysis of web crawling from a systems’ perspective </li></ul><ul><li>Issues </li></ul><ul><ul><li>Threads vs. events </li></ul></ul><ul><ul><li>Distributed implementation </li></ul></ul><ul><ul><li>Prevention from DDoS attack </li></ul></ul><ul><ul><li>Web crawler as feed forward engine for next phases of search engine </li></ul></ul>International Conference on Information Science and Applications 2010 [ Introduction ]
  8. 8. <ul><li>First ever threads vs. events debate from web crawlers perspective </li></ul><ul><li>MapReduce architecture for distributed web crawler implementation </li></ul><ul><li>Implications towards birth of operating system for Internet based applications e.g. web crawlers </li></ul>International Conference on Information Science and Applications 2010 [ Introduction ]
  9. 9. <ul><li>Threads vs. Events </li></ul><ul><li>Performance Evaluation for Threads vs. Events </li></ul>International Conference on Information Science and Applications 2010
  10. 10. <ul><li>Problems in Threads </li></ul><ul><ul><li>Large memory footprint </li></ul></ul><ul><ul><li>Context switch overhead </li></ul></ul><ul><ul><li>Cache and TLB misses </li></ul></ul><ul><ul><li>Expensive synchronization mechanisms </li></ul></ul><ul><li>Problems in Events </li></ul><ul><ul><li>Add to programmers’ difficulty </li></ul></ul><ul><ul><li>Debugging is troublesome </li></ul></ul>International Conference on Information Science and Applications 2010 [ Implementation Alternatives ]
  11. 11. <ul><li>Environment </li></ul><ul><ul><li>CPU: Intel Pentium 4 Core 2 Duo 3GHz </li></ul></ul><ul><ul><li>RAM: 3.2 GB </li></ul></ul><ul><ul><li>OS: Linux 2.6.28-11-generic </li></ul></ul><ul><li>Experiments </li></ul><ul><ul><li>1 st experiment: </li></ul></ul><ul><ul><ul><li>Comparison of crawler throughput with varying pool size </li></ul></ul></ul><ul><ul><li>2 nd experiment: </li></ul></ul><ul><ul><ul><li>Comparison of crawler throughput with varying seed URL size </li></ul></ul></ul>International Conference on Information Science and Applications 2010 [ Implementation Alternatives ]
  12. 12. International Conference on Information Science and Applications 2010 [ Implementation Alternatives ] No. of Seed URLs were kept constant at 1000
  13. 13. International Conference on Information Science and Applications 2010 [ Implementation Alternatives ] Pool size was kept constant at 200
  14. 14. <ul><li>High Level View of MapReduce Usage </li></ul><ul><li>High Level Distributed Design with MapReduce </li></ul><ul><li>Prevention of DDoS Attack </li></ul>International Conference on Information Science and Applications 2010
  15. 15. International Conference on Information Science and Applications 2010 [ Crawler Architecture]
  16. 16. International Conference on Information Science and Applications 2010 The distributed implementation was done with our own version of MapReduce[DG04] library. [ Crawler Architecture]
  17. 17. Target server: yahoo.com Same crawling machines Simultaneous and continuing connections [ Crawler Architecture] International Conference on Information Science and Applications 2010
  18. 18. [ Crawler Architecture] International Conference on Information Science and Applications 2010 Push Right-side Order URL Pop Left-side Priority 1 a.com 1 2 a.com /a 7 3 1 . a.com 5 4 b.com 2 5 c.net 3 6 1 . b.com 6 7 c.com 4
  19. 19. IMPLICATIONS International Conference on Information Science and Applications 2010
  20. 20. <ul><li>Observations during implementation of feed forward mechanisms in web crawler </li></ul><ul><ul><li>Exokernel based approach favorable for web crawler </li></ul></ul><ul><ul><ul><li>Priority queue control </li></ul></ul></ul><ul><ul><ul><li>Filesystem should not provide consistency guarantees </li></ul></ul></ul><ul><ul><ul><li>Indexing and dictionary concept should be supported by file system </li></ul></ul></ul>SEARCH ENGINE OPERATING SYSTEM International Conference on Information Science and Applications 2010
  21. 21. <ul><li>[DG04] Dean, J., and Ghemawat, S., “ MapReduce: simplified data processing on large clusters,” In Proc. 6 th Int’l Symposium on Operating Systems Design and Implementation , San Francisco, CA, 2004: 137-150. </li></ul><ul><li>[LLWL08] Lee, H.T., Leonard, D., Wang, X., and Loguinov, D., “IRLbot: scaling to 6 billion pages and beyond,” In Proc. 17th Int’l Conf. on World Wide Web , April 21-25, 2008, Beijing, China.  </li></ul><ul><li>[SS02] Shkapenyuk, V. and Suel, T., “Design and Implementation of a High-Performance Distributed Web Crawler,” In Proc. 18th Int’l Conf. on Data Engineering , pp. 3-57, San Jose, California, USA, Feb. 2002. </li></ul>International Conference on Information Science and Applications 2010

×