SlideShare a Scribd company logo
1 of 32
Download to read offline
Outline                           WIRE Project                   Web Crawler               Conclusions




            WIRE: an Open Source Web Information
                    Retrieval Environment

                           Carlos Castillo and Ricardo Baeza-Yates
                                            Center for Web Research
                                             http://www.cwr.cl/
                                          CS Dept., University of Chile


                                              OSWIR 2005
                                           Compiegne, France
                                           September 19, 2005

Carlos Castillo and Ricardo Baeza-Yates                                        Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                        http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions




          1 WIRE Project



          2 Web Crawler



          3 Conclusions




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project                     Web Crawler                      Conclusions



General Architecture

                                                             XML Index           XML Search
                      Focused Crawling




                                                                                  Text Search
                                                             Text Index
                  Crawling                Collection
                                                              Statistics



                  Importing                                   Extracting



                              Clustering         Classification



Carlos Castillo and Ricardo Baeza-Yates                                                 Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                                 http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Characteristics



          b Roughly 25,000 lines of open-source C/C++ code
          L Asynchronous DNS and HTTP requests, small memory
            and processing requirements (except during the analysis)
          V Highly configurable: rate of download, parser parameters,
            scheduling policy, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Characteristics



          b Roughly 25,000 lines of open-source C/C++ code
          L Asynchronous DNS and HTTP requests, small memory
            and processing requirements (except during the analysis)
          V Highly configurable: rate of download, parser parameters,
            scheduling policy, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Characteristics



          b Roughly 25,000 lines of open-source C/C++ code
          L Asynchronous DNS and HTTP requests, small memory
            and processing requirements (except during the analysis)
          V Highly configurable: rate of download, parser parameters,
            scheduling policy, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project                        Web Crawler                      Conclusions



Web Crawler
                                                       Manager
                                                 Page score calculations
                                                 Long-term scheduling




                       Seeder                                                    Harvester
                                                       Collection
                    Link resolving                                          Short-term scheduling
                   Robots exclusions                                          Network transfers




                                                      Gatherer
                                                       Parsing
                                                    Link extraction


Carlos Castillo and Ricardo Baeza-Yates                                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                                    http://www.cwr.cl/
Outline                           WIRE Project                    Web Crawler                     Conclusions



Scheduling


                                                 Future      Current
                                                                           =    Profit
                                                 Value        Value



                                                }
                      quality             0.4
             P1       freshness           0.1                              = Profit: 0.36
                                                    0.4       0.04
                      visited?            1



                                                }
                      quality             0.7
             P2       freshness           0.9                              = Profit: 0.07
                                                              0.63
                                                    0.7
                      visited?            1



                                                }
                      quality             0.6
                      freshness           -                               = Profit: 0.6
             P3                                     0.6       0
                      visited?            0

Carlos Castillo and Ricardo Baeza-Yates                                              Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                               http://www.cwr.cl/
Outline                            WIRE Project                         Web Crawler                            Conclusions



Downloading pages


                                                                  World Wide Web




          Web sites           S1          S2          S3          S4          S5          S6          S7
                                   P1,1        P2,1        P3,1        P4,1        P5,1        P6,1        P7,1
                                   P1,2        P2,2        P3,2        P4,2        P5,2        P6,2        P7,2
                                   P1,3        P2,3                    P4,3        P5,3        P6,2        P7,3
          Web pages
                                   P1,4        P2,4                    P4,4        P5,4                    P7,4
                                               P2,5                    P4,5                                P7,5
                                               P2,6
Carlos Castillo and Ricardo Baeza-Yates                                                           Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                                            http://www.cwr.cl/
Outline                           WIRE Project                   Web Crawler                   Conclusions



Storing contents
                                Document

                                                         1         hash(       )
                                                 Content seen?

                                      2



                                                   3
                                                             Disk Storage


                                     Free space list

Carlos Castillo and Ricardo Baeza-Yates                                            Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                            http://www.cwr.cl/
Outline                           WIRE Project                   Web Crawler                       Conclusions



URL parsing

                                     http://host.domain.com/dir/file.html
                            1

                                                                    3
                h1('host.domain.com')


                                                                   h2('235 dir/file.html')




                host.domain.com 235
                                                 2
                                                             235 path/file.html 9421
                                                                                      4
                            SITE-ID = 235; DOC-ID = 9421

Carlos Castillo and Ricardo Baeza-Yates                                              Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                                http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Data analysis


          b Includes link analysis and extraction of statistics (data is
            exported as .csv files)
          b Reports are generated using LTEXand gnuplot
                                          A

          b Report about documents: histograms of size, in- and
            out-degree, link scores, page depth, HTTP responses,
            age, media types, etc.
          b Report about sites: degree distribution in the hostgraph,
            maximum depth, pages per site, link structure, etc.



Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Data analysis


          b Includes link analysis and extraction of statistics (data is
            exported as .csv files)
          b Reports are generated using LTEXand gnuplot
                                          A

          b Report about documents: histograms of size, in- and
            out-degree, link scores, page depth, HTTP responses,
            age, media types, etc.
          b Report about sites: degree distribution in the hostgraph,
            maximum depth, pages per site, link structure, etc.



Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Data analysis


          b Includes link analysis and extraction of statistics (data is
            exported as .csv files)
          b Reports are generated using LTEXand gnuplot
                                          A

          b Report about documents: histograms of size, in- and
            out-degree, link scores, page depth, HTTP responses,
            age, media types, etc.
          b Report about sites: degree distribution in the hostgraph,
            maximum depth, pages per site, link structure, etc.



Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Data analysis


          b Includes link analysis and extraction of statistics (data is
            exported as .csv files)
          b Reports are generated using LTEXand gnuplot
                                          A

          b Report about documents: histograms of size, in- and
            out-degree, link scores, page depth, HTTP responses,
            age, media types, etc.
          b Report about sites: degree distribution in the hostgraph,
            maximum depth, pages per site, link structure, etc.



Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


           V A tool for Web characterization studies
           V Can be extended for other purposes
           V Code and documentation available at
             http://www.cwr.cl/projects/

                                                 Thank you.




Carlos Castillo and Ricardo Baeza-Yates                                     Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                     http://www.cwr.cl/
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


           V A tool for Web characterization studies
           V Can be extended for other purposes
           V Code and documentation available at
             http://www.cwr.cl/projects/

                                                 Thank you.




Carlos Castillo and Ricardo Baeza-Yates                                     Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                     http://www.cwr.cl/
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


           V A tool for Web characterization studies
           V Can be extended for other purposes
           V Code and documentation available at
             http://www.cwr.cl/projects/

                                                 Thank you.




Carlos Castillo and Ricardo Baeza-Yates                                     Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                     http://www.cwr.cl/
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


           V A tool for Web characterization studies
           V Can be extended for other purposes
           V Code and documentation available at
             http://www.cwr.cl/projects/

                                                 Thank you.




Carlos Castillo and Ricardo Baeza-Yates                                     Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                     http://www.cwr.cl/
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


           V A tool for Web characterization studies
           V Can be extended for other purposes
           V Code and documentation available at
             http://www.cwr.cl/projects/

                                                 Thank you.




Carlos Castillo and Ricardo Baeza-Yates                                     Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                     http://www.cwr.cl/

More Related Content

Similar to WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Ceph used in Cancer Research at OICR
Ceph used in Cancer Research at OICRCeph used in Cancer Research at OICR
Ceph used in Cancer Research at OICRCeph Community
 
Time -Travel on the Internet
Time -Travel on the InternetTime -Travel on the Internet
Time -Travel on the InternetIRJET Journal
 
Invincea: Reasoning in Incident Response in Tapio
Invincea: Reasoning in Incident Response in TapioInvincea: Reasoning in Incident Response in Tapio
Invincea: Reasoning in Incident Response in TapioInvincea, Inc.
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInHakka Labs
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Artefactual Systems - AtoM
 
Accelerating your Research with Microsoft Azure (June 2015)
Accelerating your Research with Microsoft Azure (June 2015)Accelerating your Research with Microsoft Azure (June 2015)
Accelerating your Research with Microsoft Azure (June 2015)Microsoft Azure for Research
 
GENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian FosterGENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian FosterIan Foster
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commonsJesse Wang
 
Devoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in JavaDevoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in JavaNGDATA
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSSri Ambati
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital PreservationMat Kelly
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopNikolai Avteniev
 
Case Study for Ego-centric Citation Network
Case Study for Ego-centric Citation NetworkCase Study for Ego-centric Citation Network
Case Study for Ego-centric Citation NetworkMike Taylor
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Blue BRIDGE
 
Descriptive Standards and Applications in Memory Institutions
Descriptive Standards and Applications in Memory InstitutionsDescriptive Standards and Applications in Memory Institutions
Descriptive Standards and Applications in Memory InstitutionsE. Murphy
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman
 
2012 09 aos-workshop-johanneskeizer
2012 09 aos-workshop-johanneskeizer2012 09 aos-workshop-johanneskeizer
2012 09 aos-workshop-johanneskeizerJohannes Keizer
 

Similar to WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne) (20)

Ceph used in Cancer Research at OICR
Ceph used in Cancer Research at OICRCeph used in Cancer Research at OICR
Ceph used in Cancer Research at OICR
 
Time -Travel on the Internet
Time -Travel on the InternetTime -Travel on the Internet
Time -Travel on the Internet
 
Invincea: Reasoning in Incident Response in Tapio
Invincea: Reasoning in Incident Response in TapioInvincea: Reasoning in Incident Response in Tapio
Invincea: Reasoning in Incident Response in Tapio
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
 
Accelerating your Research with Microsoft Azure (June 2015)
Accelerating your Research with Microsoft Azure (June 2015)Accelerating your Research with Microsoft Azure (June 2015)
Accelerating your Research with Microsoft Azure (June 2015)
 
GENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian FosterGENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian Foster
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
 
Devoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in JavaDevoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in Java
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Towards a Web of Services
Towards a Web of ServicesTowards a Web of Services
Towards a Web of Services
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital Preservation
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On Hadoop
 
Case Study for Ego-centric Citation Network
Case Study for Ego-centric Citation NetworkCase Study for Ego-centric Citation Network
Case Study for Ego-centric Citation Network
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
 
Publishing Linked Data from RDB
Publishing Linked Data from RDBPublishing Linked Data from RDB
Publishing Linked Data from RDB
 
Descriptive Standards and Applications in Memory Institutions
Descriptive Standards and Applications in Memory InstitutionsDescriptive Standards and Applications in Memory Institutions
Descriptive Standards and Applications in Memory Institutions
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)
 
2012 09 aos-workshop-johanneskeizer
2012 09 aos-workshop-johanneskeizer2012 09 aos-workshop-johanneskeizer
2012 09 aos-workshop-johanneskeizer
 

More from Carlos Castillo (ChaTo)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social MediaCarlos Castillo (ChaTo)
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Carlos Castillo (ChaTo)
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Carlos Castillo (ChaTo)
 

More from Carlos Castillo (ChaTo) (20)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Link prediction
Link predictionLink prediction
Link prediction
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Indexing
IndexingIndexing
Indexing
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 

Recently uploaded

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 

Recently uploaded (20)

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 

WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

  • 1. Outline WIRE Project Web Crawler Conclusions WIRE: an Open Source Web Information Retrieval Environment Carlos Castillo and Ricardo Baeza-Yates Center for Web Research http://www.cwr.cl/ CS Dept., University of Chile OSWIR 2005 Compiegne, France September 19, 2005 Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 2. Outline WIRE Project Web Crawler Conclusions 1 WIRE Project 2 Web Crawler 3 Conclusions Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 3. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 4. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 5. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 6. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 7. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 8. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 9. Outline WIRE Project Web Crawler Conclusions General Architecture XML Index XML Search Focused Crawling Text Search Text Index Crawling Collection Statistics Importing Extracting Clustering Classification Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 10. Outline WIRE Project Web Crawler Conclusions Characteristics b Roughly 25,000 lines of open-source C/C++ code L Asynchronous DNS and HTTP requests, small memory and processing requirements (except during the analysis) V Highly configurable: rate of download, parser parameters, scheduling policy, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 11. Outline WIRE Project Web Crawler Conclusions Characteristics b Roughly 25,000 lines of open-source C/C++ code L Asynchronous DNS and HTTP requests, small memory and processing requirements (except during the analysis) V Highly configurable: rate of download, parser parameters, scheduling policy, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 12. Outline WIRE Project Web Crawler Conclusions Characteristics b Roughly 25,000 lines of open-source C/C++ code L Asynchronous DNS and HTTP requests, small memory and processing requirements (except during the analysis) V Highly configurable: rate of download, parser parameters, scheduling policy, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 13. Outline WIRE Project Web Crawler Conclusions Web Crawler Manager Page score calculations Long-term scheduling Seeder Harvester Collection Link resolving Short-term scheduling Robots exclusions Network transfers Gatherer Parsing Link extraction Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 14. Outline WIRE Project Web Crawler Conclusions Scheduling Future Current = Profit Value Value } quality 0.4 P1 freshness 0.1 = Profit: 0.36 0.4 0.04 visited? 1 } quality 0.7 P2 freshness 0.9 = Profit: 0.07 0.63 0.7 visited? 1 } quality 0.6 freshness - = Profit: 0.6 P3 0.6 0 visited? 0 Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 15. Outline WIRE Project Web Crawler Conclusions Downloading pages World Wide Web Web sites S1 S2 S3 S4 S5 S6 S7 P1,1 P2,1 P3,1 P4,1 P5,1 P6,1 P7,1 P1,2 P2,2 P3,2 P4,2 P5,2 P6,2 P7,2 P1,3 P2,3 P4,3 P5,3 P6,2 P7,3 Web pages P1,4 P2,4 P4,4 P5,4 P7,4 P2,5 P4,5 P7,5 P2,6 Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 16. Outline WIRE Project Web Crawler Conclusions Storing contents Document 1 hash( ) Content seen? 2 3 Disk Storage Free space list Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 17. Outline WIRE Project Web Crawler Conclusions URL parsing http://host.domain.com/dir/file.html 1 3 h1('host.domain.com') h2('235 dir/file.html') host.domain.com 235 2 235 path/file.html 9421 4 SITE-ID = 235; DOC-ID = 9421 Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 18. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 19. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 20. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 21. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 22. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 23. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 24. Outline WIRE Project Web Crawler Conclusions Data analysis b Includes link analysis and extraction of statistics (data is exported as .csv files) b Reports are generated using LTEXand gnuplot A b Report about documents: histograms of size, in- and out-degree, link scores, page depth, HTTP responses, age, media types, etc. b Report about sites: degree distribution in the hostgraph, maximum depth, pages per site, link structure, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 25. Outline WIRE Project Web Crawler Conclusions Data analysis b Includes link analysis and extraction of statistics (data is exported as .csv files) b Reports are generated using LTEXand gnuplot A b Report about documents: histograms of size, in- and out-degree, link scores, page depth, HTTP responses, age, media types, etc. b Report about sites: degree distribution in the hostgraph, maximum depth, pages per site, link structure, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 26. Outline WIRE Project Web Crawler Conclusions Data analysis b Includes link analysis and extraction of statistics (data is exported as .csv files) b Reports are generated using LTEXand gnuplot A b Report about documents: histograms of size, in- and out-degree, link scores, page depth, HTTP responses, age, media types, etc. b Report about sites: degree distribution in the hostgraph, maximum depth, pages per site, link structure, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 27. Outline WIRE Project Web Crawler Conclusions Data analysis b Includes link analysis and extraction of statistics (data is exported as .csv files) b Reports are generated using LTEXand gnuplot A b Report about documents: histograms of size, in- and out-degree, link scores, page depth, HTTP responses, age, media types, etc. b Report about sites: degree distribution in the hostgraph, maximum depth, pages per site, link structure, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 28. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 29. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 30. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 31. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 32. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/