The document discusses crawling the infinite web by analyzing user behavior when navigating websites. It proposes modeling user navigation with a set of actions at different levels of a site and studying the probability of each action. The goal is to determine how many levels of a site a search engine needs to crawl to index the most important content without wasting resources on deeper levels unlikely to be accessed by users. Experiments will analyze real user navigation data to identify the optimal number of levels to crawl for each site.
Talk based on: Ricardo Baeza-Yates and Carlos Castillo: “Web Retrieval and Mining”.Entry in “Encyclopedia of Library and Information Sciences”, third edition (to appear in 2009).
Understanding the qualities of Web robot traffic is essential to build mechanisms for mitigating the impact of their traffic on Web systems. This project presents an updated characterization of the navigational and session patterns of Web robot traffic across three Web servers in the United States, Europe, and the Middle East under 30 different features. The results indicate that some features may be fitted to the same heavy-tailed model across the Web servers, but the best fitting models for other features depend on the Web server. Due to some different tasks of Web robots and security policies set by website administrators, there are thus some features of Web robot traffic that cannot be universally modeled. The paper titled “Some (Non-)Universal Features of Web Robot Traffic” which presents the report of this project has been accepted at 52th Annual Conference on Information Sciences and Systems (CISS).
Talk based on: Ricardo Baeza-Yates and Carlos Castillo: “Web Retrieval and Mining”.Entry in “Encyclopedia of Library and Information Sciences”, third edition (to appear in 2009).
Understanding the qualities of Web robot traffic is essential to build mechanisms for mitigating the impact of their traffic on Web systems. This project presents an updated characterization of the navigational and session patterns of Web robot traffic across three Web servers in the United States, Europe, and the Middle East under 30 different features. The results indicate that some features may be fitted to the same heavy-tailed model across the Web servers, but the best fitting models for other features depend on the Web server. Due to some different tasks of Web robots and security policies set by website administrators, there are thus some features of Web robot traffic that cannot be universally modeled. The paper titled “Some (Non-)Universal Features of Web Robot Traffic” which presents the report of this project has been accepted at 52th Annual Conference on Information Sciences and Systems (CISS).
Data mining refers to the process of analysing the data from different perspectives and summarizing it into useful information.
Data mining software is one of the number of tools used for analysing data. It allows users to analyse from many different dimensions and angles, categorize it, and summarize the relationship identified.
Data mining is about technique for finding and describing Structural Patterns in data.
Data mining is the process of finding correlation or patterns among fields in large relational databases.
The process of extracting valid, previously unknown, comprehensible , and actionable information from large databases and using it to make crucial business decisions.
RDA Web service discoverability workshopNiall Beard
Niall Beards presentation about the BiodiversityCatalogue and how it facilitates web service discoverability, its interaction with Taverna, and it's interoperability with the bio.tools registry.
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolMichael Nelson
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Justin F. Brunelle
Michael L. Nelson
Lyudmila Balakireva
Robert Sanderson
Herbert Van de Sompel
TPDL 2013, September 24, 2013
Similar to Crawling the Infinite Web (WAW 2004 Rome) (20)
Keynote at the Dutch-Belgian Information Retrieval Workshop, November 2016, Delft, Netherlands.
Based on KDD 2016 tutorial with Sara Hajian and Francesco Bonchi.
KDD 2016 tutorial on Algorithmic Bias, Parts I and II.
Video:
Part I: https://www.youtube.com/watch?v=mJcWrfoGup8
Part II: https://www.youtube.com/watch?v=nKemhMbaYcU
Part III: https://www.youtube.com/watch?v=ErgHjxJsEKA
By Sara Hajian, Francesco Bonchi, and Carlos Castillo.
http://francescobonchi.com/algorithmic_bias_tutorial.html
KDD 2016 tutorial on Algorithmic Bias, Parts III and IV.
Video: https://www.youtube.com/watch?v=ErgHjxJsEKA
By Sara Hajian, Francesco Bonchi, and Carlos Castillo.
http://francescobonchi.com/algorithmic_bias_tutorial.html
Various examples of observational studies, mostly fo the analysis of social media.
Lecture for the M. Sc. Data Science, Sapienza University of Rome, Spring 2016.
Basic concepts about natural experiments, based mostly on Dunning's book.
Lecture for the M. Sc. Data Science, Sapienza University of Rome, Spring 2016.
Predictions of links in graphs based on content and information propagations.
Lecture for the M. Sc. Data Science, Sapienza University of Rome, Spring 2016.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Epistemic Interaction - tuning interfaces to provide information for AI support
Crawling the Infinite Web (WAW 2004 Rome)
1. Outline Introduction Models Experiments Summary
Crawling the Infinite Web:
Five Levels are Enough
Ricardo Baeza-Yates and Carlos Castillo
Center for Web Research
www.cwr.cl
WAW 2004
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
2. Outline Introduction Models Experiments Summary
1 Introduction
2 Models
3 Experiments
4 Summary
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
3. Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterative
algorithms, etc.
The number of pages on the Web can be considered infinite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
4. Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterative
algorithms, etc.
The number of pages on the Web can be considered infinite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
5. Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterative
algorithms, etc.
The number of pages on the Web can be considered infinite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
6. Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterative
algorithms, etc.
The number of pages on the Web can be considered infinite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
7. Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterative
algorithms, etc.
The number of pages on the Web can be considered infinite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
8. Outline Introduction Models Experiments Summary
Conflicting interests
Web site administrator: would like to have all of the Web
site indexed
Search engine administrator: would like to use efficiently
the network and storage capacity available
Search engine user: would like to find what he is looking for
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
9. Outline Introduction Models Experiments Summary
Conflicting interests
Web site administrator: would like to have all of the Web
site indexed
Search engine administrator: would like to use efficiently
the network and storage capacity available
Search engine user: would like to find what he is looking for
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
10. Outline Introduction Models Experiments Summary
Conflicting interests
Web site administrator: would like to have all of the Web
site indexed
Search engine administrator: would like to use efficiently
the network and storage capacity available
Search engine user: would like to find what he is looking for
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
11. Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
12. Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
13. Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
14. Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
15. Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
16. Outline Introduction Models Experiments Summary
Models
Navigating a tree ≈ Moving through levels
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
17. Outline Introduction Models Experiments Summary
Actions
Possible actions at a given level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
18. Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actions
A = {next, start/jump, back, stay , prev , fwd}
Pr (action| ) is the probability of taking an action
action∈A Pr (action| )=1
The probability Pr (next| ) is constant
Stationary distribution → how much time users spent at each
level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
19. Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actions
A = {next, start/jump, back, stay , prev , fwd}
Pr (action| ) is the probability of taking an action
action∈A Pr (action| )=1
The probability Pr (next| ) is constant
Stationary distribution → how much time users spent at each
level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
20. Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actions
A = {next, start/jump, back, stay , prev , fwd}
Pr (action| ) is the probability of taking an action
action∈A Pr (action| )=1
The probability Pr (next| ) is constant
Stationary distribution → how much time users spent at each
level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
21. Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actions
A = {next, start/jump, back, stay , prev , fwd}
Pr (action| ) is the probability of taking an action
action∈A Pr (action| )=1
The probability Pr (next| ) is constant
Stationary distribution → how much time users spent at each
level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
22. Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actions
A = {next, start/jump, back, stay , prev , fwd}
Pr (action| ) is the probability of taking an action
action∈A Pr (action| )=1
The probability Pr (next| ) is constant
Stationary distribution → how much time users spent at each
level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
23. Outline Introduction Models Experiments Summary
Model A
Forwards and backwards one level at a time
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
24. Outline Introduction Models Experiments Summary
Model A
Forwards and backwards one level at a time
Birth and death process
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
25. Outline Introduction Models Experiments Summary
Model B
Back to first level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
26. Outline Introduction Models Experiments Summary
Model B
Back to first level
Birth and death process with extinction
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
27. Outline Introduction Models Experiments Summary
Model C
Back to any previous level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
28. Outline Introduction Models Experiments Summary
Model C
Back to any previous level
Birth and death process with extinction and disaster?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
29. Outline Introduction Models Experiments Summary
Cumulative probability of levels 0 . . . k
Based on solutions given in the paper
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
30. Outline Introduction Models Experiments Summary
Experiments
Anonimized access logs for 13 Websites
Educational - Commercial - Reference - Organization - Blogs
Analysis of access logs to extract ≈ 250,000 user sessions
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
31. Outline Introduction Models Experiments Summary
Experiments
Anonimized access logs for 13 Websites
Educational - Commercial - Reference - Organization - Blogs
Analysis of access logs to extract ≈ 250,000 user sessions
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
32. Outline Introduction Models Experiments Summary
Experiments
Anonimized access logs for 13 Websites
Educational - Commercial - Reference - Organization - Blogs
Analysis of access logs to extract ≈ 250,000 user sessions
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
33. Outline Introduction Models Experiments Summary
Distribution of visits per level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
34. Outline Introduction Models Experiments Summary
Model fitting
Code Type Country Model q Error
E1 Educational Chile B 0.51 0.88%
E2 Educational Spain B 0.51 2.29%
E3 Educational US B 0.64 0.72%
C1 Commercial Chile B 0.55 0.39%
C2 Commercial Chile B 0.62 5.17%
R1 Reference Chile B 0.54 2.96%
R2 Reference Chile B 0.59 2.75%
O1 Organization Italy C 0.35 2.27%
O2 Organization US B 0.62 2.31%
OB1 Organization + Blog Chile B 0.65 2.07%
OB2 Organization + Blog Chile B 0.72 0.35%
B1 Blog Chile C 0.79 0.88%
B2 Blog Chile C 0.63 1.01%
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
35. Outline Introduction Models Experiments Summary
Observed distribution of transitions
Level Obs. Next Start Jump Back Stay Prev
0 247985 0.457 – 0.527 – 0.008 –
1 120482 0.459 – 0.332 0.185 0.017 –
2 70911 0.462 0.111 0.235 0.171 0.014 –
3 42311 0.497 0.065 0.186 0.159 0.017 0.069
4 27129 0.514 0.057 0.157 0.171 0.009 0.088
5 17544 0.549 0.048 0.138 0.143 0.009 0.108
6 10296 0.555 0.037 0.133 0.155 0.009 0.106
7 6326 0.596 0.033 0.135 0.113 0.006 0.113
8 4200 0.637 0.024 0.104 0.127 0.006 0.096
9 2782 0.663 0.015 0.108 0.113 0.006 0.089
10 2089 0.662 0.037 0.084 0.120 0.005 0.086
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
36. Outline Introduction Models Experiments Summary
Observed distribution of transitions
Level Obs. Next Start Jump Back Stay Prev
0 247985 0.457 – 0.527 – 0.008 –
1 120482 0.459 – 0.332 0.185 0.017 –
2 70911 0.462 0.111 0.235 0.171 0.014 –
3 42311 0.497 0.065 0.186 0.159 0.017 0.069
4 27129 0.514 0.057 0.157 0.171 0.009 0.088
5 17544 0.549 0.048 0.138 0.143 0.009 0.108
6 10296 0.555 0.037 0.133 0.155 0.009 0.106
7 6326 0.596 0.033 0.135 0.113 0.006 0.113
8 4200 0.637 0.024 0.104 0.127 0.006 0.096
9 2782 0.663 0.015 0.108 0.113 0.006 0.089
10 2089 0.662 0.037 0.084 0.120 0.005 0.086
Pr (next) is not constant, if you have spent some time in the Web site,
then you can spend some more
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
37. Outline Introduction Models Experiments Summary
Pagerank and depth
Cumulative Pagerank by levels in the Chilean Web
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
38. Outline Introduction Models Experiments Summary
Pagerank and depth
Correlation of Pagerank and depth is low at deeper levels
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
39. Outline Introduction Models Experiments Summary
Summary
90% of the visits are 4-5 clicks away from the home page,
except in blogs
Simple models try to explain this behavior
In the paper: explicit methodology, closed solutions to the
models, references
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
40. Outline Introduction Models Experiments Summary
Summary
90% of the visits are 4-5 clicks away from the home page,
except in blogs
Simple models try to explain this behavior
In the paper: explicit methodology, closed solutions to the
models, references
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
41. Outline Introduction Models Experiments Summary
Summary
90% of the visits are 4-5 clicks away from the home page,
except in blogs
Simple models try to explain this behavior
In the paper: explicit methodology, closed solutions to the
models, references
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
42. Outline Introduction Models Experiments Summary
Open problems
A model which better fits empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that affect the
desired crawling depth in a Web site?
There are other ways of defining which pages to download
from an infinite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
43. Outline Introduction Models Experiments Summary
Open problems
A model which better fits empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that affect the
desired crawling depth in a Web site?
There are other ways of defining which pages to download
from an infinite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
44. Outline Introduction Models Experiments Summary
Open problems
A model which better fits empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that affect the
desired crawling depth in a Web site?
There are other ways of defining which pages to download
from an infinite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
45. Outline Introduction Models Experiments Summary
Open problems
A model which better fits empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that affect the
desired crawling depth in a Web site?
There are other ways of defining which pages to download
from an infinite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
46. Outline Introduction Models Experiments Summary
Open problems
A model which better fits empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that affect the
desired crawling depth in a Web site?
There are other ways of defining which pages to download
from an infinite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
47. Outline Introduction Models Experiments Summary
Open problems
A model which better fits empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that affect the
desired crawling depth in a Web site?
There are other ways of defining which pages to download
from an infinite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
48. Outline Introduction Models Experiments Summary
Questions and comments . . .
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web