SlideShare a Scribd company logo
Outline                    Introduction           Models             Experiments                 Summary




                                    Crawling the Infinite Web:
                                     Five Levels are Enough

                                 Ricardo Baeza-Yates and Carlos Castillo

                                           Center for Web Research
                                                 www.cwr.cl


                                               WAW 2004


R. Baeza-Yates and C. Castillo                                                     Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models   Experiments                 Summary




          1 Introduction


          2 Models


          3 Experiments


          4 Summary




R. Baeza-Yates and C. Castillo                                   Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models         Experiments                 Summary




Introduction



               Dynamic page: “a page which is created on request”
               Dynamic pages with links to other dynamic pages
               Malicious: loops and/or near-duplicates
               Legitimate: recommendation systems, calendars, iterative
               algorithms, etc.
               The number of pages on the Web can be considered infinite




R. Baeza-Yates and C. Castillo                                         Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models         Experiments                 Summary




Introduction



               Dynamic page: “a page which is created on request”
               Dynamic pages with links to other dynamic pages
               Malicious: loops and/or near-duplicates
               Legitimate: recommendation systems, calendars, iterative
               algorithms, etc.
               The number of pages on the Web can be considered infinite




R. Baeza-Yates and C. Castillo                                         Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models         Experiments                 Summary




Introduction



               Dynamic page: “a page which is created on request”
               Dynamic pages with links to other dynamic pages
               Malicious: loops and/or near-duplicates
               Legitimate: recommendation systems, calendars, iterative
               algorithms, etc.
               The number of pages on the Web can be considered infinite




R. Baeza-Yates and C. Castillo                                         Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models         Experiments                 Summary




Introduction



               Dynamic page: “a page which is created on request”
               Dynamic pages with links to other dynamic pages
               Malicious: loops and/or near-duplicates
               Legitimate: recommendation systems, calendars, iterative
               algorithms, etc.
               The number of pages on the Web can be considered infinite




R. Baeza-Yates and C. Castillo                                         Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models         Experiments                 Summary




Introduction



               Dynamic page: “a page which is created on request”
               Dynamic pages with links to other dynamic pages
               Malicious: loops and/or near-duplicates
               Legitimate: recommendation systems, calendars, iterative
               algorithms, etc.
               The number of pages on the Web can be considered infinite




R. Baeza-Yates and C. Castillo                                         Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models       Experiments                 Summary




Conflicting interests



               Web site administrator: would like to have all of the Web
               site indexed
               Search engine administrator: would like to use efficiently
               the network and storage capacity available
               Search engine user: would like to find what he is looking for




R. Baeza-Yates and C. Castillo                                       Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models       Experiments                 Summary




Conflicting interests



               Web site administrator: would like to have all of the Web
               site indexed
               Search engine administrator: would like to use efficiently
               the network and storage capacity available
               Search engine user: would like to find what he is looking for




R. Baeza-Yates and C. Castillo                                       Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models       Experiments                 Summary




Conflicting interests



               Web site administrator: would like to have all of the Web
               site indexed
               Search engine administrator: would like to use efficiently
               the network and storage capacity available
               Search engine user: would like to find what he is looking for




R. Baeza-Yates and C. Castillo                                       Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models          Experiments                 Summary




Our approach



               Users do not go so deep inside Web sites
               If something is important it has to be easily reachable
               We will download only a few levels of each Web site
               How many levels?
               How much do you lost?




R. Baeza-Yates and C. Castillo                                          Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models          Experiments                 Summary




Our approach



               Users do not go so deep inside Web sites
               If something is important it has to be easily reachable
               We will download only a few levels of each Web site
               How many levels?
               How much do you lost?




R. Baeza-Yates and C. Castillo                                          Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models          Experiments                 Summary




Our approach



               Users do not go so deep inside Web sites
               If something is important it has to be easily reachable
               We will download only a few levels of each Web site
               How many levels?
               How much do you lost?




R. Baeza-Yates and C. Castillo                                          Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models          Experiments                 Summary




Our approach



               Users do not go so deep inside Web sites
               If something is important it has to be easily reachable
               We will download only a few levels of each Web site
               How many levels?
               How much do you lost?




R. Baeza-Yates and C. Castillo                                          Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models          Experiments                 Summary




Our approach



               Users do not go so deep inside Web sites
               If something is important it has to be easily reachable
               We will download only a few levels of each Web site
               How many levels?
               How much do you lost?




R. Baeza-Yates and C. Castillo                                          Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models   Experiments                 Summary




Models
Navigating a tree ≈ Moving through levels




R. Baeza-Yates and C. Castillo                                   Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models   Experiments                 Summary




Actions
Possible actions at a given level




R. Baeza-Yates and C. Castillo                                   Center for Web Research
Crawling the Infinite Web
Outline                    Introduction      Models        Experiments                 Summary




Type of models we study



               There is a set of atomic actions
               A = {next, start/jump, back, stay , prev , fwd}
               Pr (action| ) is the probability of taking an action
                    action∈A Pr (action|   )=1
               The probability Pr (next| ) is constant
               Stationary distribution → how much time users spent at each
               level




R. Baeza-Yates and C. Castillo                                           Center for Web Research
Crawling the Infinite Web
Outline                    Introduction      Models        Experiments                 Summary




Type of models we study



               There is a set of atomic actions
               A = {next, start/jump, back, stay , prev , fwd}
               Pr (action| ) is the probability of taking an action
                    action∈A Pr (action|   )=1
               The probability Pr (next| ) is constant
               Stationary distribution → how much time users spent at each
               level




R. Baeza-Yates and C. Castillo                                           Center for Web Research
Crawling the Infinite Web
Outline                    Introduction      Models        Experiments                 Summary




Type of models we study



               There is a set of atomic actions
               A = {next, start/jump, back, stay , prev , fwd}
               Pr (action| ) is the probability of taking an action
                    action∈A Pr (action|   )=1
               The probability Pr (next| ) is constant
               Stationary distribution → how much time users spent at each
               level




R. Baeza-Yates and C. Castillo                                           Center for Web Research
Crawling the Infinite Web
Outline                    Introduction      Models        Experiments                 Summary




Type of models we study



               There is a set of atomic actions
               A = {next, start/jump, back, stay , prev , fwd}
               Pr (action| ) is the probability of taking an action
                    action∈A Pr (action|   )=1
               The probability Pr (next| ) is constant
               Stationary distribution → how much time users spent at each
               level




R. Baeza-Yates and C. Castillo                                           Center for Web Research
Crawling the Infinite Web
Outline                    Introduction      Models        Experiments                 Summary




Type of models we study



               There is a set of atomic actions
               A = {next, start/jump, back, stay , prev , fwd}
               Pr (action| ) is the probability of taking an action
                    action∈A Pr (action|   )=1
               The probability Pr (next| ) is constant
               Stationary distribution → how much time users spent at each
               level




R. Baeza-Yates and C. Castillo                                           Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models   Experiments                 Summary




Model A
Forwards and backwards one level at a time




R. Baeza-Yates and C. Castillo                                   Center for Web Research
Crawling the Infinite Web
Outline                    Introduction           Models            Experiments                 Summary




Model A
Forwards and backwards one level at a time




                                          Birth and death process




R. Baeza-Yates and C. Castillo                                                    Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models   Experiments                 Summary




Model B
Back to first level




R. Baeza-Yates and C. Castillo                                   Center for Web Research
Crawling the Infinite Web
Outline                    Introduction          Models           Experiments                 Summary




Model B
Back to first level




                                 Birth and death process with extinction



R. Baeza-Yates and C. Castillo                                                  Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models   Experiments                 Summary




Model C
Back to any previous level




R. Baeza-Yates and C. Castillo                                   Center for Web Research
Crawling the Infinite Web
Outline                    Introduction      Models            Experiments                 Summary




Model C
Back to any previous level




                      Birth and death process with extinction and disaster?




R. Baeza-Yates and C. Castillo                                               Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models   Experiments                 Summary




Cumulative probability of levels 0 . . . k
Based on solutions given in the paper




R. Baeza-Yates and C. Castillo                                   Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models        Experiments                 Summary




Experiments




               Anonimized access logs for 13 Websites
               Educational - Commercial - Reference - Organization - Blogs
               Analysis of access logs to extract ≈ 250,000 user sessions




R. Baeza-Yates and C. Castillo                                        Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models        Experiments                 Summary




Experiments




               Anonimized access logs for 13 Websites
               Educational - Commercial - Reference - Organization - Blogs
               Analysis of access logs to extract ≈ 250,000 user sessions




R. Baeza-Yates and C. Castillo                                        Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models        Experiments                 Summary




Experiments




               Anonimized access logs for 13 Websites
               Educational - Commercial - Reference - Organization - Blogs
               Analysis of access logs to extract ≈ 250,000 user sessions




R. Baeza-Yates and C. Castillo                                        Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models   Experiments                 Summary




Distribution of visits per level




R. Baeza-Yates and C. Castillo                                   Center for Web Research
Crawling the Infinite Web
Outline                    Introduction      Models        Experiments                  Summary




Model fitting
          Code                Type             Country   Model      q         Error
           E1              Educational          Chile     B        0.51      0.88%
           E2              Educational          Spain     B        0.51      2.29%
           E3              Educational           US       B        0.64      0.72%
           C1              Commercial           Chile     B        0.55      0.39%
           C2              Commercial           Chile     B        0.62      5.17%
           R1               Reference           Chile     B        0.54      2.96%
           R2               Reference           Chile     B        0.59      2.75%
           O1             Organization          Italy     C        0.35      2.27%
           O2             Organization           US       B        0.62      2.31%
          OB1          Organization + Blog      Chile     B        0.65      2.07%
          OB2          Organization + Blog      Chile     B        0.72      0.35%
           B1                 Blog              Chile     C        0.79      0.88%
           B2                 Blog              Chile     C        0.63      1.01%
R. Baeza-Yates and C. Castillo                                            Center for Web Research
Crawling the Infinite Web
Outline                    Introduction         Models            Experiments                 Summary




Observed distribution of transitions

          Level         Obs.          Next    Start      Jump    Back      Stay        Prev
            0         247985          0.457     –        0.527     –       0.008        –
            1         120482          0.459     –        0.332   0.185     0.017        –
            2          70911          0.462   0.111      0.235   0.171     0.014        –
            3          42311          0.497   0.065      0.186   0.159     0.017      0.069
            4          27129          0.514   0.057      0.157   0.171     0.009      0.088
            5          17544          0.549   0.048      0.138   0.143     0.009      0.108
            6         10296           0.555   0.037      0.133   0.155     0.009      0.106
            7          6326           0.596   0.033      0.135   0.113     0.006      0.113
            8          4200           0.637   0.024      0.104   0.127     0.006      0.096
            9          2782           0.663   0.015      0.108   0.113     0.006      0.089
           10           2089          0.662   0.037      0.084   0.120     0.005      0.086


R. Baeza-Yates and C. Castillo                                                  Center for Web Research
Crawling the Infinite Web
Outline                    Introduction         Models            Experiments                 Summary




Observed distribution of transitions
          Level         Obs.          Next    Start      Jump    Back      Stay        Prev
            0         247985          0.457     –        0.527     –       0.008        –
            1         120482          0.459     –        0.332   0.185     0.017        –
            2          70911          0.462   0.111      0.235   0.171     0.014        –
            3          42311          0.497   0.065      0.186   0.159     0.017      0.069
            4          27129          0.514   0.057      0.157   0.171     0.009      0.088
            5          17544          0.549   0.048      0.138   0.143     0.009      0.108
            6         10296           0.555   0.037      0.133   0.155     0.009      0.106
            7          6326           0.596   0.033      0.135   0.113     0.006      0.113
            8          4200           0.637   0.024      0.104   0.127     0.006      0.096
            9          2782           0.663   0.015      0.108   0.113     0.006      0.089
           10           2089          0.662   0.037      0.084   0.120     0.005      0.086
          Pr (next) is not constant, if you have spent some time in the Web site,
                              then you can spend some more

R. Baeza-Yates and C. Castillo                                                  Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models   Experiments                 Summary




Pagerank and depth
Cumulative Pagerank by levels in the Chilean Web




R. Baeza-Yates and C. Castillo                                   Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models            Experiments                 Summary




Pagerank and depth
Correlation of Pagerank and depth is low at deeper levels




R. Baeza-Yates and C. Castillo                                            Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models            Experiments                 Summary




Summary



               90% of the visits are 4-5 clicks away from the home page,
               except in blogs
               Simple models try to explain this behavior
               In the paper: explicit methodology, closed solutions to the
               models, references




R. Baeza-Yates and C. Castillo                                            Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models            Experiments                 Summary




Summary



               90% of the visits are 4-5 clicks away from the home page,
               except in blogs
               Simple models try to explain this behavior
               In the paper: explicit methodology, closed solutions to the
               models, references




R. Baeza-Yates and C. Castillo                                            Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models            Experiments                 Summary




Summary



               90% of the visits are 4-5 clicks away from the home page,
               except in blogs
               Simple models try to explain this behavior
               In the paper: explicit methodology, closed solutions to the
               models, references




R. Baeza-Yates and C. Castillo                                            Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models         Experiments                 Summary




Open problems


               A model which better fits empirical data
               Analyzing blogs
               Analyzing the textual content of pages to decide when to stop
               Relationship of this with the spam detection problem
               Try adaptive strategies: which are the factors that affect the
               desired crawling depth in a Web site?
               There are other ways of defining which pages to download
               from an infinite set




R. Baeza-Yates and C. Castillo                                         Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models         Experiments                 Summary




Open problems


               A model which better fits empirical data
               Analyzing blogs
               Analyzing the textual content of pages to decide when to stop
               Relationship of this with the spam detection problem
               Try adaptive strategies: which are the factors that affect the
               desired crawling depth in a Web site?
               There are other ways of defining which pages to download
               from an infinite set




R. Baeza-Yates and C. Castillo                                         Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models         Experiments                 Summary




Open problems


               A model which better fits empirical data
               Analyzing blogs
               Analyzing the textual content of pages to decide when to stop
               Relationship of this with the spam detection problem
               Try adaptive strategies: which are the factors that affect the
               desired crawling depth in a Web site?
               There are other ways of defining which pages to download
               from an infinite set




R. Baeza-Yates and C. Castillo                                         Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models         Experiments                 Summary




Open problems


               A model which better fits empirical data
               Analyzing blogs
               Analyzing the textual content of pages to decide when to stop
               Relationship of this with the spam detection problem
               Try adaptive strategies: which are the factors that affect the
               desired crawling depth in a Web site?
               There are other ways of defining which pages to download
               from an infinite set




R. Baeza-Yates and C. Castillo                                         Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models         Experiments                 Summary




Open problems


               A model which better fits empirical data
               Analyzing blogs
               Analyzing the textual content of pages to decide when to stop
               Relationship of this with the spam detection problem
               Try adaptive strategies: which are the factors that affect the
               desired crawling depth in a Web site?
               There are other ways of defining which pages to download
               from an infinite set




R. Baeza-Yates and C. Castillo                                         Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models         Experiments                 Summary




Open problems


               A model which better fits empirical data
               Analyzing blogs
               Analyzing the textual content of pages to decide when to stop
               Relationship of this with the spam detection problem
               Try adaptive strategies: which are the factors that affect the
               desired crawling depth in a Web site?
               There are other ways of defining which pages to download
               from an infinite set




R. Baeza-Yates and C. Castillo                                         Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models   Experiments                 Summary




          Questions and comments . . .




R. Baeza-Yates and C. Castillo                                   Center for Web Research
Crawling the Infinite Web

More Related Content

Viewers also liked

Political economy of Lisbon strategy
Political economy of Lisbon strategyPolitical economy of Lisbon strategy
Political economy of Lisbon strategy
gogrowth
 
At what Speed are EU-27 Member States Approaching the Lisbon Targets?
At what Speed are EU-27 Member States Approaching the Lisbon Targets?At what Speed are EU-27 Member States Approaching the Lisbon Targets?
At what Speed are EU-27 Member States Approaching the Lisbon Targets?
gogrowth
 
Mia's Sweeties 2nd try
Mia's Sweeties 2nd tryMia's Sweeties 2nd try
Mia's Sweeties 2nd try
j4cap
 
“It’s About Brains…”
“It’s About Brains…”“It’s About Brains…”
“It’s About Brains…”
gogrowth
 
Rummetomkroppen2
Rummetomkroppen2Rummetomkroppen2
Rummetomkroppen2drz
 
Tracking the Timetable to Lisbon
Tracking the Timetable to LisbonTracking the Timetable to Lisbon
Tracking the Timetable to Lisbon
gogrowth
 
Worldinside
WorldinsideWorldinside
Worldinside
drz
 
Mia's Sweeties
Mia's SweetiesMia's Sweeties
Mia's Sweetiesj4cap
 
Boca New High School Graphics
Boca  New  High  School  GraphicsBoca  New  High  School  Graphics
Boca New High School Graphics
bmahoney
 
Sambuichi Workshop Karch 2008
Sambuichi Workshop Karch 2008Sambuichi Workshop Karch 2008
Sambuichi Workshop Karch 2008
drz
 
Read 180
Read 180Read 180
Web Information Retrieval and Mining
Web Information Retrieval and MiningWeb Information Retrieval and Mining
Web Information Retrieval and Mining
Carlos Castillo (ChaTo)
 

Viewers also liked (13)

Political economy of Lisbon strategy
Political economy of Lisbon strategyPolitical economy of Lisbon strategy
Political economy of Lisbon strategy
 
Generalizing PageRank (Pisa)
Generalizing PageRank (Pisa)Generalizing PageRank (Pisa)
Generalizing PageRank (Pisa)
 
At what Speed are EU-27 Member States Approaching the Lisbon Targets?
At what Speed are EU-27 Member States Approaching the Lisbon Targets?At what Speed are EU-27 Member States Approaching the Lisbon Targets?
At what Speed are EU-27 Member States Approaching the Lisbon Targets?
 
Mia's Sweeties 2nd try
Mia's Sweeties 2nd tryMia's Sweeties 2nd try
Mia's Sweeties 2nd try
 
“It’s About Brains…”
“It’s About Brains…”“It’s About Brains…”
“It’s About Brains…”
 
Rummetomkroppen2
Rummetomkroppen2Rummetomkroppen2
Rummetomkroppen2
 
Tracking the Timetable to Lisbon
Tracking the Timetable to LisbonTracking the Timetable to Lisbon
Tracking the Timetable to Lisbon
 
Worldinside
WorldinsideWorldinside
Worldinside
 
Mia's Sweeties
Mia's SweetiesMia's Sweeties
Mia's Sweeties
 
Boca New High School Graphics
Boca  New  High  School  GraphicsBoca  New  High  School  Graphics
Boca New High School Graphics
 
Sambuichi Workshop Karch 2008
Sambuichi Workshop Karch 2008Sambuichi Workshop Karch 2008
Sambuichi Workshop Karch 2008
 
Read 180
Read 180Read 180
Read 180
 
Web Information Retrieval and Mining
Web Information Retrieval and MiningWeb Information Retrieval and Mining
Web Information Retrieval and Mining
 

Similar to Crawling the Infinite Web (WAW 2004 Rome)

IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET Journal
 
Chasing web-based malware
Chasing web-based malwareChasing web-based malware
Chasing web-based malwareFACE
 
Towards a Web of Services
Towards a Web of ServicesTowards a Web of Services
Towards a Web of Services
Carlos Pedrinaci
 
Oct 2014 Siteimprove Stockholm Accessibility Conference
Oct 2014 Siteimprove Stockholm Accessibility ConferenceOct 2014 Siteimprove Stockholm Accessibility Conference
Oct 2014 Siteimprove Stockholm Accessibility Conference
Kevin Rydberg
 
Web mining
Web miningWeb mining
Web mining
Jay Lohokare
 
Modified naive bayes model for improved web page classification
Modified naive bayes model for improved web page classificationModified naive bayes model for improved web page classification
Modified naive bayes model for improved web page classification
Hammad Haleem
 
4 Web Crawler.pptx
4 Web Crawler.pptx4 Web Crawler.pptx
4 Web Crawler.pptx
DEEPAK948083
 
Presentation mz
Presentation mzPresentation mz
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypresNekoGato
 
Michalis Vafopoulos: Initial thoughts about existence in the Web
Michalis Vafopoulos: Initial thoughts about existence in the WebMichalis Vafopoulos: Initial thoughts about existence in the Web
Michalis Vafopoulos: Initial thoughts about existence in the WebPhiloWeb
 
Searchland: Search quality for Beginners
Searchland: Search quality for BeginnersSearchland: Search quality for Beginners
Searchland: Search quality for Beginners
Valeria de Paiva
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
Introduction to ASP.NET MVC
Introduction to ASP.NET MVCIntroduction to ASP.NET MVC
Introduction to ASP.NET MVC
Mayank Srivastava
 
Why the Data Train Needs Semantic Rails -- The Case of Linked Scientometrics ...
Why the Data Train Needs Semantic Rails -- The Case of Linked Scientometrics ...Why the Data Train Needs Semantic Rails -- The Case of Linked Scientometrics ...
Why the Data Train Needs Semantic Rails -- The Case of Linked Scientometrics ...
kjanowicz
 
Minning www
Minning wwwMinning www
Minning www
Sonali Parab
 
RDA Web service discoverability workshop
RDA Web service discoverability workshopRDA Web service discoverability workshop
RDA Web service discoverability workshop
Niall Beard
 
Inverted textindexing
Inverted textindexingInverted textindexing
Inverted textindexing
Khwaja Aamer
 
Beyond the Page
Beyond the PageBeyond the Page
Beyond the Page
gsmith
 
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolEvaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Michael Nelson
 

Similar to Crawling the Infinite Web (WAW 2004 Rome) (20)

IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
 
Chasing web-based malware
Chasing web-based malwareChasing web-based malware
Chasing web-based malware
 
Towards a Web of Services
Towards a Web of ServicesTowards a Web of Services
Towards a Web of Services
 
Oct 2014 Siteimprove Stockholm Accessibility Conference
Oct 2014 Siteimprove Stockholm Accessibility ConferenceOct 2014 Siteimprove Stockholm Accessibility Conference
Oct 2014 Siteimprove Stockholm Accessibility Conference
 
Web mining
Web miningWeb mining
Web mining
 
Modified naive bayes model for improved web page classification
Modified naive bayes model for improved web page classificationModified naive bayes model for improved web page classification
Modified naive bayes model for improved web page classification
 
4 Web Crawler.pptx
4 Web Crawler.pptx4 Web Crawler.pptx
4 Web Crawler.pptx
 
Presentation mz
Presentation mzPresentation mz
Presentation mz
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
Michalis Vafopoulos: Initial thoughts about existence in the Web
Michalis Vafopoulos: Initial thoughts about existence in the WebMichalis Vafopoulos: Initial thoughts about existence in the Web
Michalis Vafopoulos: Initial thoughts about existence in the Web
 
Searchland: Search quality for Beginners
Searchland: Search quality for BeginnersSearchland: Search quality for Beginners
Searchland: Search quality for Beginners
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
Introduction to ASP.NET MVC
Introduction to ASP.NET MVCIntroduction to ASP.NET MVC
Introduction to ASP.NET MVC
 
Why the Data Train Needs Semantic Rails -- The Case of Linked Scientometrics ...
Why the Data Train Needs Semantic Rails -- The Case of Linked Scientometrics ...Why the Data Train Needs Semantic Rails -- The Case of Linked Scientometrics ...
Why the Data Train Needs Semantic Rails -- The Case of Linked Scientometrics ...
 
Minning www
Minning wwwMinning www
Minning www
 
RDA Web service discoverability workshop
RDA Web service discoverability workshopRDA Web service discoverability workshop
RDA Web service discoverability workshop
 
Inverted textindexing
Inverted textindexingInverted textindexing
Inverted textindexing
 
Beyond the Page
Beyond the PageBeyond the Page
Beyond the Page
 
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolEvaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
 

More from Carlos Castillo (ChaTo)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
Carlos Castillo (ChaTo)
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
Carlos Castillo (ChaTo)
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Carlos Castillo (ChaTo)
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
Carlos Castillo (ChaTo)
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
Carlos Castillo (ChaTo)
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
Carlos Castillo (ChaTo)
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
Carlos Castillo (ChaTo)
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
Carlos Castillo (ChaTo)
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
Carlos Castillo (ChaTo)
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
Carlos Castillo (ChaTo)
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
Carlos Castillo (ChaTo)
 
Link prediction
Link predictionLink prediction
Link prediction
Carlos Castillo (ChaTo)
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
Carlos Castillo (ChaTo)
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
Carlos Castillo (ChaTo)
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
Carlos Castillo (ChaTo)
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
Carlos Castillo (ChaTo)
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
Carlos Castillo (ChaTo)
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
Carlos Castillo (ChaTo)
 
Indexing
IndexingIndexing
Text Summarization
Text SummarizationText Summarization
Text Summarization
Carlos Castillo (ChaTo)
 

More from Carlos Castillo (ChaTo) (20)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Link prediction
Link predictionLink prediction
Link prediction
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Indexing
IndexingIndexing
Indexing
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 

Recently uploaded

Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 

Recently uploaded (20)

Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 

Crawling the Infinite Web (WAW 2004 Rome)

  • 1. Outline Introduction Models Experiments Summary Crawling the Infinite Web: Five Levels are Enough Ricardo Baeza-Yates and Carlos Castillo Center for Web Research www.cwr.cl WAW 2004 R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 2. Outline Introduction Models Experiments Summary 1 Introduction 2 Models 3 Experiments 4 Summary R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 3. Outline Introduction Models Experiments Summary Introduction Dynamic page: “a page which is created on request” Dynamic pages with links to other dynamic pages Malicious: loops and/or near-duplicates Legitimate: recommendation systems, calendars, iterative algorithms, etc. The number of pages on the Web can be considered infinite R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 4. Outline Introduction Models Experiments Summary Introduction Dynamic page: “a page which is created on request” Dynamic pages with links to other dynamic pages Malicious: loops and/or near-duplicates Legitimate: recommendation systems, calendars, iterative algorithms, etc. The number of pages on the Web can be considered infinite R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 5. Outline Introduction Models Experiments Summary Introduction Dynamic page: “a page which is created on request” Dynamic pages with links to other dynamic pages Malicious: loops and/or near-duplicates Legitimate: recommendation systems, calendars, iterative algorithms, etc. The number of pages on the Web can be considered infinite R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 6. Outline Introduction Models Experiments Summary Introduction Dynamic page: “a page which is created on request” Dynamic pages with links to other dynamic pages Malicious: loops and/or near-duplicates Legitimate: recommendation systems, calendars, iterative algorithms, etc. The number of pages on the Web can be considered infinite R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 7. Outline Introduction Models Experiments Summary Introduction Dynamic page: “a page which is created on request” Dynamic pages with links to other dynamic pages Malicious: loops and/or near-duplicates Legitimate: recommendation systems, calendars, iterative algorithms, etc. The number of pages on the Web can be considered infinite R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 8. Outline Introduction Models Experiments Summary Conflicting interests Web site administrator: would like to have all of the Web site indexed Search engine administrator: would like to use efficiently the network and storage capacity available Search engine user: would like to find what he is looking for R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 9. Outline Introduction Models Experiments Summary Conflicting interests Web site administrator: would like to have all of the Web site indexed Search engine administrator: would like to use efficiently the network and storage capacity available Search engine user: would like to find what he is looking for R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 10. Outline Introduction Models Experiments Summary Conflicting interests Web site administrator: would like to have all of the Web site indexed Search engine administrator: would like to use efficiently the network and storage capacity available Search engine user: would like to find what he is looking for R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 11. Outline Introduction Models Experiments Summary Our approach Users do not go so deep inside Web sites If something is important it has to be easily reachable We will download only a few levels of each Web site How many levels? How much do you lost? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 12. Outline Introduction Models Experiments Summary Our approach Users do not go so deep inside Web sites If something is important it has to be easily reachable We will download only a few levels of each Web site How many levels? How much do you lost? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 13. Outline Introduction Models Experiments Summary Our approach Users do not go so deep inside Web sites If something is important it has to be easily reachable We will download only a few levels of each Web site How many levels? How much do you lost? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 14. Outline Introduction Models Experiments Summary Our approach Users do not go so deep inside Web sites If something is important it has to be easily reachable We will download only a few levels of each Web site How many levels? How much do you lost? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 15. Outline Introduction Models Experiments Summary Our approach Users do not go so deep inside Web sites If something is important it has to be easily reachable We will download only a few levels of each Web site How many levels? How much do you lost? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 16. Outline Introduction Models Experiments Summary Models Navigating a tree ≈ Moving through levels R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 17. Outline Introduction Models Experiments Summary Actions Possible actions at a given level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 18. Outline Introduction Models Experiments Summary Type of models we study There is a set of atomic actions A = {next, start/jump, back, stay , prev , fwd} Pr (action| ) is the probability of taking an action action∈A Pr (action| )=1 The probability Pr (next| ) is constant Stationary distribution → how much time users spent at each level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 19. Outline Introduction Models Experiments Summary Type of models we study There is a set of atomic actions A = {next, start/jump, back, stay , prev , fwd} Pr (action| ) is the probability of taking an action action∈A Pr (action| )=1 The probability Pr (next| ) is constant Stationary distribution → how much time users spent at each level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 20. Outline Introduction Models Experiments Summary Type of models we study There is a set of atomic actions A = {next, start/jump, back, stay , prev , fwd} Pr (action| ) is the probability of taking an action action∈A Pr (action| )=1 The probability Pr (next| ) is constant Stationary distribution → how much time users spent at each level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 21. Outline Introduction Models Experiments Summary Type of models we study There is a set of atomic actions A = {next, start/jump, back, stay , prev , fwd} Pr (action| ) is the probability of taking an action action∈A Pr (action| )=1 The probability Pr (next| ) is constant Stationary distribution → how much time users spent at each level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 22. Outline Introduction Models Experiments Summary Type of models we study There is a set of atomic actions A = {next, start/jump, back, stay , prev , fwd} Pr (action| ) is the probability of taking an action action∈A Pr (action| )=1 The probability Pr (next| ) is constant Stationary distribution → how much time users spent at each level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 23. Outline Introduction Models Experiments Summary Model A Forwards and backwards one level at a time R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 24. Outline Introduction Models Experiments Summary Model A Forwards and backwards one level at a time Birth and death process R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 25. Outline Introduction Models Experiments Summary Model B Back to first level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 26. Outline Introduction Models Experiments Summary Model B Back to first level Birth and death process with extinction R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 27. Outline Introduction Models Experiments Summary Model C Back to any previous level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 28. Outline Introduction Models Experiments Summary Model C Back to any previous level Birth and death process with extinction and disaster? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 29. Outline Introduction Models Experiments Summary Cumulative probability of levels 0 . . . k Based on solutions given in the paper R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 30. Outline Introduction Models Experiments Summary Experiments Anonimized access logs for 13 Websites Educational - Commercial - Reference - Organization - Blogs Analysis of access logs to extract ≈ 250,000 user sessions R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 31. Outline Introduction Models Experiments Summary Experiments Anonimized access logs for 13 Websites Educational - Commercial - Reference - Organization - Blogs Analysis of access logs to extract ≈ 250,000 user sessions R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 32. Outline Introduction Models Experiments Summary Experiments Anonimized access logs for 13 Websites Educational - Commercial - Reference - Organization - Blogs Analysis of access logs to extract ≈ 250,000 user sessions R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 33. Outline Introduction Models Experiments Summary Distribution of visits per level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 34. Outline Introduction Models Experiments Summary Model fitting Code Type Country Model q Error E1 Educational Chile B 0.51 0.88% E2 Educational Spain B 0.51 2.29% E3 Educational US B 0.64 0.72% C1 Commercial Chile B 0.55 0.39% C2 Commercial Chile B 0.62 5.17% R1 Reference Chile B 0.54 2.96% R2 Reference Chile B 0.59 2.75% O1 Organization Italy C 0.35 2.27% O2 Organization US B 0.62 2.31% OB1 Organization + Blog Chile B 0.65 2.07% OB2 Organization + Blog Chile B 0.72 0.35% B1 Blog Chile C 0.79 0.88% B2 Blog Chile C 0.63 1.01% R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 35. Outline Introduction Models Experiments Summary Observed distribution of transitions Level Obs. Next Start Jump Back Stay Prev 0 247985 0.457 – 0.527 – 0.008 – 1 120482 0.459 – 0.332 0.185 0.017 – 2 70911 0.462 0.111 0.235 0.171 0.014 – 3 42311 0.497 0.065 0.186 0.159 0.017 0.069 4 27129 0.514 0.057 0.157 0.171 0.009 0.088 5 17544 0.549 0.048 0.138 0.143 0.009 0.108 6 10296 0.555 0.037 0.133 0.155 0.009 0.106 7 6326 0.596 0.033 0.135 0.113 0.006 0.113 8 4200 0.637 0.024 0.104 0.127 0.006 0.096 9 2782 0.663 0.015 0.108 0.113 0.006 0.089 10 2089 0.662 0.037 0.084 0.120 0.005 0.086 R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 36. Outline Introduction Models Experiments Summary Observed distribution of transitions Level Obs. Next Start Jump Back Stay Prev 0 247985 0.457 – 0.527 – 0.008 – 1 120482 0.459 – 0.332 0.185 0.017 – 2 70911 0.462 0.111 0.235 0.171 0.014 – 3 42311 0.497 0.065 0.186 0.159 0.017 0.069 4 27129 0.514 0.057 0.157 0.171 0.009 0.088 5 17544 0.549 0.048 0.138 0.143 0.009 0.108 6 10296 0.555 0.037 0.133 0.155 0.009 0.106 7 6326 0.596 0.033 0.135 0.113 0.006 0.113 8 4200 0.637 0.024 0.104 0.127 0.006 0.096 9 2782 0.663 0.015 0.108 0.113 0.006 0.089 10 2089 0.662 0.037 0.084 0.120 0.005 0.086 Pr (next) is not constant, if you have spent some time in the Web site, then you can spend some more R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 37. Outline Introduction Models Experiments Summary Pagerank and depth Cumulative Pagerank by levels in the Chilean Web R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 38. Outline Introduction Models Experiments Summary Pagerank and depth Correlation of Pagerank and depth is low at deeper levels R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 39. Outline Introduction Models Experiments Summary Summary 90% of the visits are 4-5 clicks away from the home page, except in blogs Simple models try to explain this behavior In the paper: explicit methodology, closed solutions to the models, references R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 40. Outline Introduction Models Experiments Summary Summary 90% of the visits are 4-5 clicks away from the home page, except in blogs Simple models try to explain this behavior In the paper: explicit methodology, closed solutions to the models, references R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 41. Outline Introduction Models Experiments Summary Summary 90% of the visits are 4-5 clicks away from the home page, except in blogs Simple models try to explain this behavior In the paper: explicit methodology, closed solutions to the models, references R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 42. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 43. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 44. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 45. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 46. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 47. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 48. Outline Introduction Models Experiments Summary Questions and comments . . . R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web