Outline                           WIRE Project                   Web Crawler               Conclusions




            WIR...
Outline                           WIRE Project               Web Crawler               Conclusions




          1 WIRE Pr...
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


       ...
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


       ...
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


       ...
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


       ...
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


       ...
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


       ...
Outline                           WIRE Project                     Web Crawler                      Conclusions



General...
Outline                           WIRE Project               Web Crawler               Conclusions



Characteristics



 ...
Outline                           WIRE Project               Web Crawler               Conclusions



Characteristics



 ...
Outline                           WIRE Project               Web Crawler               Conclusions



Characteristics



 ...
Outline                           WIRE Project                        Web Crawler                      Conclusions



Web ...
Outline                           WIRE Project                    Web Crawler                     Conclusions



Schedulin...
Outline                            WIRE Project                         Web Crawler                            Conclusions...
Outline                           WIRE Project                   Web Crawler                   Conclusions



Storing cont...
Outline                           WIRE Project                   Web Crawler                       Conclusions



URL pars...
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems

...
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems

...
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems

...
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems

...
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems

...
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems

...
Outline                           WIRE Project               Web Crawler               Conclusions



Data analysis


    ...
Outline                           WIRE Project               Web Crawler               Conclusions



Data analysis


    ...
Outline                           WIRE Project               Web Crawler               Conclusions



Data analysis


    ...
Outline                           WIRE Project               Web Crawler               Conclusions



Data analysis


    ...
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


     ...
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


     ...
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


     ...
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


     ...
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


     ...
Upcoming SlideShare
Loading in …5
×

WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

1,321 views
1,226 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,321
On SlideShare
0
From Embeds
0
Number of Embeds
26
Actions
Shares
0
Downloads
31
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

  1. 1. Outline WIRE Project Web Crawler Conclusions WIRE: an Open Source Web Information Retrieval Environment Carlos Castillo and Ricardo Baeza-Yates Center for Web Research http://www.cwr.cl/ CS Dept., University of Chile OSWIR 2005 Compiegne, France September 19, 2005 Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  2. 2. Outline WIRE Project Web Crawler Conclusions 1 WIRE Project 2 Web Crawler 3 Conclusions Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  3. 3. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  4. 4. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  5. 5. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  6. 6. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  7. 7. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  8. 8. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  9. 9. Outline WIRE Project Web Crawler Conclusions General Architecture XML Index XML Search Focused Crawling Text Search Text Index Crawling Collection Statistics Importing Extracting Clustering Classification Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  10. 10. Outline WIRE Project Web Crawler Conclusions Characteristics b Roughly 25,000 lines of open-source C/C++ code L Asynchronous DNS and HTTP requests, small memory and processing requirements (except during the analysis) V Highly configurable: rate of download, parser parameters, scheduling policy, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  11. 11. Outline WIRE Project Web Crawler Conclusions Characteristics b Roughly 25,000 lines of open-source C/C++ code L Asynchronous DNS and HTTP requests, small memory and processing requirements (except during the analysis) V Highly configurable: rate of download, parser parameters, scheduling policy, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  12. 12. Outline WIRE Project Web Crawler Conclusions Characteristics b Roughly 25,000 lines of open-source C/C++ code L Asynchronous DNS and HTTP requests, small memory and processing requirements (except during the analysis) V Highly configurable: rate of download, parser parameters, scheduling policy, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  13. 13. Outline WIRE Project Web Crawler Conclusions Web Crawler Manager Page score calculations Long-term scheduling Seeder Harvester Collection Link resolving Short-term scheduling Robots exclusions Network transfers Gatherer Parsing Link extraction Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  14. 14. Outline WIRE Project Web Crawler Conclusions Scheduling Future Current = Profit Value Value } quality 0.4 P1 freshness 0.1 = Profit: 0.36 0.4 0.04 visited? 1 } quality 0.7 P2 freshness 0.9 = Profit: 0.07 0.63 0.7 visited? 1 } quality 0.6 freshness - = Profit: 0.6 P3 0.6 0 visited? 0 Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  15. 15. Outline WIRE Project Web Crawler Conclusions Downloading pages World Wide Web Web sites S1 S2 S3 S4 S5 S6 S7 P1,1 P2,1 P3,1 P4,1 P5,1 P6,1 P7,1 P1,2 P2,2 P3,2 P4,2 P5,2 P6,2 P7,2 P1,3 P2,3 P4,3 P5,3 P6,2 P7,3 Web pages P1,4 P2,4 P4,4 P5,4 P7,4 P2,5 P4,5 P7,5 P2,6 Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  16. 16. Outline WIRE Project Web Crawler Conclusions Storing contents Document 1 hash( ) Content seen? 2 3 Disk Storage Free space list Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  17. 17. Outline WIRE Project Web Crawler Conclusions URL parsing http://host.domain.com/dir/file.html 1 3 h1('host.domain.com') h2('235 dir/file.html') host.domain.com 235 2 235 path/file.html 9421 4 SITE-ID = 235; DOC-ID = 9421 Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  18. 18. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  19. 19. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  20. 20. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  21. 21. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  22. 22. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  23. 23. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  24. 24. Outline WIRE Project Web Crawler Conclusions Data analysis b Includes link analysis and extraction of statistics (data is exported as .csv files) b Reports are generated using LTEXand gnuplot A b Report about documents: histograms of size, in- and out-degree, link scores, page depth, HTTP responses, age, media types, etc. b Report about sites: degree distribution in the hostgraph, maximum depth, pages per site, link structure, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  25. 25. Outline WIRE Project Web Crawler Conclusions Data analysis b Includes link analysis and extraction of statistics (data is exported as .csv files) b Reports are generated using LTEXand gnuplot A b Report about documents: histograms of size, in- and out-degree, link scores, page depth, HTTP responses, age, media types, etc. b Report about sites: degree distribution in the hostgraph, maximum depth, pages per site, link structure, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  26. 26. Outline WIRE Project Web Crawler Conclusions Data analysis b Includes link analysis and extraction of statistics (data is exported as .csv files) b Reports are generated using LTEXand gnuplot A b Report about documents: histograms of size, in- and out-degree, link scores, page depth, HTTP responses, age, media types, etc. b Report about sites: degree distribution in the hostgraph, maximum depth, pages per site, link structure, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  27. 27. Outline WIRE Project Web Crawler Conclusions Data analysis b Includes link analysis and extraction of statistics (data is exported as .csv files) b Reports are generated using LTEXand gnuplot A b Report about documents: histograms of size, in- and out-degree, link scores, page depth, HTTP responses, age, media types, etc. b Report about sites: degree distribution in the hostgraph, maximum depth, pages per site, link structure, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  28. 28. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  29. 29. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  30. 30. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  31. 31. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  32. 32. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

×