The document provides an introduction to web crawling and extraction. It begins with an overview of web crawling and extraction, including definitions and examples of motivations for these techniques such as bookmarking sites, extracting business hours, and enabling vertical search. It then discusses desired properties of crawlers like speed, as well as constraints like politeness, distributed scaling, and minimizing overlap. The basic crawling algorithm is presented. Challenges of large-scale crawls are discussed, such as DNS lookups, number of URLs crawled, and politeness.
Enabling a Data Mesh Architecture and Data Sharing Culture with DenodoDenodo
Watch full webinar here: https://bit.ly/3wATS7b
Data Mesh presents a new distributed and decentralized paradigm for data management, where autonomous domains expose their own data as "data products" to the rest of the organization. It tries to reduce bottlenecks derived from an excessive dependance on centralized IT teams, and capitalizes on the specialized data knowledge that domain users already have. However, Data Mesh literature leaves the implementation of these ideas very open to each organization.
Watch on-demand and learn:
- Deep dive into the key ideas of Data Mesh
- Understand how Denodo can help you implement a Data Mesh
- Hear directly from a Denodo client, Landsbankinn, their journey from a traditional analytic architecture to Data Mesh using Denodo
Reltio: Powering Enterprise Data-driven Applications with CassandraDataStax Academy
Cassandra's flexibility and scalability make it an ideal foundation for a modern data management architecture. Come hear how Reltio is using Cassandra, in combination with graph technologies and Spark to deliver a new breed of data-driven applications.
In this presentation you'll find out:
- How we ended up selecting Cassandra
- The unique characteristics of data-driven applications
- The best practices we learned by combining Cassandra, graph technology, Spark and more
Tutorial given at ICWE'13, Aalborg, Denmark on 08.07.2013
Abstract:
Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. In this tutorial, we will introduce the audience to five topics: architecture and implementation of high-performance web crawler, collaborative web crawling, crawling the deep Web, crawling multimedia content and future directions in web crawling research.
To cite this tutorial:
Please refer to http://dx.doi.org/10.1007/978-3-642-39200-9_49
Enabling a Data Mesh Architecture and Data Sharing Culture with DenodoDenodo
Watch full webinar here: https://bit.ly/3wATS7b
Data Mesh presents a new distributed and decentralized paradigm for data management, where autonomous domains expose their own data as "data products" to the rest of the organization. It tries to reduce bottlenecks derived from an excessive dependance on centralized IT teams, and capitalizes on the specialized data knowledge that domain users already have. However, Data Mesh literature leaves the implementation of these ideas very open to each organization.
Watch on-demand and learn:
- Deep dive into the key ideas of Data Mesh
- Understand how Denodo can help you implement a Data Mesh
- Hear directly from a Denodo client, Landsbankinn, their journey from a traditional analytic architecture to Data Mesh using Denodo
Reltio: Powering Enterprise Data-driven Applications with CassandraDataStax Academy
Cassandra's flexibility and scalability make it an ideal foundation for a modern data management architecture. Come hear how Reltio is using Cassandra, in combination with graph technologies and Spark to deliver a new breed of data-driven applications.
In this presentation you'll find out:
- How we ended up selecting Cassandra
- The unique characteristics of data-driven applications
- The best practices we learned by combining Cassandra, graph technology, Spark and more
Tutorial given at ICWE'13, Aalborg, Denmark on 08.07.2013
Abstract:
Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. In this tutorial, we will introduce the audience to five topics: architecture and implementation of high-performance web crawler, collaborative web crawling, crawling the deep Web, crawling multimedia content and future directions in web crawling research.
To cite this tutorial:
Please refer to http://dx.doi.org/10.1007/978-3-642-39200-9_49
Brand Mentions vs Links: Making PR Coverage Work Harder for SEO.James Brockbank
In this session from BrightonSEO, James shares insights into how to make your PR coverage work harder for SEO by avoiding brand mentions and maximising links.
ArcGIS JavaScript API (build a web layer-based map application with html5 and...Stefano Marchisio
The "ArcGIS JavaScript API", sits directly on top of Dojo framework, providing developers with access to Dojo user interface widgets and all the other benefits of Dojo core. Whit this ArcGIS you can build a html5/javascript mapping applications and the api allows you to easily embed maps in your web pages. An ArcGIS application utilizes a layer-based (TiledLayer, DynamicLayer, FeatureLayer, etc...) geographic information model for characterizing and describing our world. An ArcGIS application asks what it need, through a http/rest service (the service will return images or json data - for example) hosted on the ArcGIS server. In this simple html5/javascript demo project (http://sdrv.ms/UGlW0p) you can find five examples that show the basic functionality of the mapping framework "ArcGIS API for JavaScript" (will be shown the basic functionality of the ArcGIS classes layer). You can download the demo code at this link: http://sdrv.ms/UGlW0p - There is also a video on YouTube: http://youtu.be/2IV29O0dW2M
BrightonSEO April 2022 - Kara Thurkettle - Search in the Metaverse.pdf🇺🇲 🇬🇧 Kara Thurkettle
1) What is "The Metaverse?"
2) What technologies are being used within The Metaverse?
3) How are these technologies transforming shopping behavior?
4) How are the technologies impacting search?
5) Why should search professionals care and how do we optimise for this?
6) What gives NFTs their value and how can we optimise them?
7) What types of experiences are industries creating with these technologies?
8) What the user concerns around "The Metaverse" and how cam we be more ethical?
BrightonSEO - Master Crawl Budget Optimization for Enterprise WebsitesManick Bhan
For every website on the internet, Google has a fixed budget for how many pages their bots can and are willing to crawl. The internet is a big place, so Googlebot can only spend so much time crawling and indexing our websites. Crawl budget optimization is the process of ensuring that the right pages of our websites end up in Google’s index and are ultimately shown to searchers.
Google’s recommendations for optimizing crawl budget are rather limited, because Googlebot crawls through most websites without reaching its limit. But enterprise-level and ecommerce sites with thousands of landing pages are at risk of maxing out their budget. A 2018 study even found that Google’s crawlers failed to crawl over half of the webpages of larger sites in the experiment.
Influencing how crawl budget is spent can be a more difficult technical optimization for strategists to implement. But for enterprise-level and ecommerce sites, it’s worth the effort to maximize crawl budget where you can. With a few tweaks, site owners and SEO strategists can guide Googlebot to regularly crawl and index their best-performing pages.
Data Lakes are meant to support many of the same analytics capabilities of Data Warehouses while overcoming some of the core problems. Yet Data Lakes have a distinctly different technology base. This webinar will provide an overview of the standard architecture components of Data Lakes.
This will include:
The Lab and the factory
The base environment for batch analytics
Critical governance components
Additional components necessary for real-time analytics and ingesting streaming data
News SEO: Why we’ve de commissioned AMP - Brighton SEO September 2021Daniel Smullen
AMP is no longer a requirement of the top stories carousel on Google. And with the introduction of page experience as a ranking factor, this talk intends to provide insights into behind the scenes of a major news publisher de-commissioning AMP. Revealing what happens to top stories performance as well as the pros and cons of the AMP framework for publishers as well as it’s future in a new page experience world in News SEO.
How not to fail at your international expansionLidia Infante
In this presentation Lidia explores the strategic and technical factors that will make or break your international SEO efforts to truly localise your business and make it a global success.
- Why do businesses expand internationally?
- Why the rinse and repeat approach fails
- 5 tips to make your international strategy succeed
Additional resources:
- How to do keyword research in a language you don’t speak: https://riseatseven.com/blog/international-keyword-research/
- The beginner’s guide to keyword mapping (with template) https://riseatseven.com/blog/keyword-mapping-guide/
- 5 key factors to a successful international SEO strategy https://riseatseven.com/blog/international-seo-strategy/
Technip Energies Italy: Planning is a graph matterNeo4j
Neo4j and Technip Energies Italy executed an Innovation Lab Sprint. The goal of the laboratory has been to frame, design and prototype the use case identified by their colleagues of Planning, Equipment and Construction disciplines, by applying Knowledge Graph technology, as the way to connect the data to gain information and insights as an immediate value, that is:
– capturing engineering deliverable milestone chain by gaining insights into a schedule
– performing reasoning on information, evidence and data
– extracting insights from data
Building a Search Intent-Driven Website Architecture (SEO Mastery Summit 2022...LazarinaStoyanova
In this talk, given for SEO Mastery Summit 2022, I:
- propose a continuous improvement planning approach for web architecture
- highlight the importance of goal-driven site architecture planning
- highlight the relationship between site goals, user personas, and search intent
- demonstrate ways of considering search intent in both page planning and structure planning
- highlight the benefits of having an intent-driven website architecture
- highlight different ways to troubleshoot for intent mismatch
- recommend some further reading to get set you on a path of success ✨
If you want to read the talk write-up, head over to: https://lazarinastoy.com/building-a-search-intent-driven-website-architecture/
The Knowledge Graph. What is it? How does it work? How do you get in? @YoastConJason Barnard
Google’s Knowledge Graph is increasingly pivotal in their search algorithm. Jason Barnard explains the why and how of the Knowledge Graph and gives the audience practical tips and tricks to fully exploit the opportunities it offers.
This talk will give an overview of Apache Nutch, its main components, how it fits with other Apache projects and its latest developments.
Apache Nutch was started exactly 10 years ago and was the starting point for what later became Apache Hadoop and also Apache Tika. Nutch is nowadays the tool of reference for large scale web crawling.
In this talk I will give an overview of Apache Nutch and describe its main components and how Nutch fits with other Apache projects such as Hadoop, SOLR or Tika.
The second part of the presentation will be focused on the latest developments in Nutch and the changes introduced by the 2.x branch with the use of Apache GORA as a front end to various NoSQL datastores.
Brand Mentions vs Links: Making PR Coverage Work Harder for SEO.James Brockbank
In this session from BrightonSEO, James shares insights into how to make your PR coverage work harder for SEO by avoiding brand mentions and maximising links.
ArcGIS JavaScript API (build a web layer-based map application with html5 and...Stefano Marchisio
The "ArcGIS JavaScript API", sits directly on top of Dojo framework, providing developers with access to Dojo user interface widgets and all the other benefits of Dojo core. Whit this ArcGIS you can build a html5/javascript mapping applications and the api allows you to easily embed maps in your web pages. An ArcGIS application utilizes a layer-based (TiledLayer, DynamicLayer, FeatureLayer, etc...) geographic information model for characterizing and describing our world. An ArcGIS application asks what it need, through a http/rest service (the service will return images or json data - for example) hosted on the ArcGIS server. In this simple html5/javascript demo project (http://sdrv.ms/UGlW0p) you can find five examples that show the basic functionality of the mapping framework "ArcGIS API for JavaScript" (will be shown the basic functionality of the ArcGIS classes layer). You can download the demo code at this link: http://sdrv.ms/UGlW0p - There is also a video on YouTube: http://youtu.be/2IV29O0dW2M
BrightonSEO April 2022 - Kara Thurkettle - Search in the Metaverse.pdf🇺🇲 🇬🇧 Kara Thurkettle
1) What is "The Metaverse?"
2) What technologies are being used within The Metaverse?
3) How are these technologies transforming shopping behavior?
4) How are the technologies impacting search?
5) Why should search professionals care and how do we optimise for this?
6) What gives NFTs their value and how can we optimise them?
7) What types of experiences are industries creating with these technologies?
8) What the user concerns around "The Metaverse" and how cam we be more ethical?
BrightonSEO - Master Crawl Budget Optimization for Enterprise WebsitesManick Bhan
For every website on the internet, Google has a fixed budget for how many pages their bots can and are willing to crawl. The internet is a big place, so Googlebot can only spend so much time crawling and indexing our websites. Crawl budget optimization is the process of ensuring that the right pages of our websites end up in Google’s index and are ultimately shown to searchers.
Google’s recommendations for optimizing crawl budget are rather limited, because Googlebot crawls through most websites without reaching its limit. But enterprise-level and ecommerce sites with thousands of landing pages are at risk of maxing out their budget. A 2018 study even found that Google’s crawlers failed to crawl over half of the webpages of larger sites in the experiment.
Influencing how crawl budget is spent can be a more difficult technical optimization for strategists to implement. But for enterprise-level and ecommerce sites, it’s worth the effort to maximize crawl budget where you can. With a few tweaks, site owners and SEO strategists can guide Googlebot to regularly crawl and index their best-performing pages.
Data Lakes are meant to support many of the same analytics capabilities of Data Warehouses while overcoming some of the core problems. Yet Data Lakes have a distinctly different technology base. This webinar will provide an overview of the standard architecture components of Data Lakes.
This will include:
The Lab and the factory
The base environment for batch analytics
Critical governance components
Additional components necessary for real-time analytics and ingesting streaming data
News SEO: Why we’ve de commissioned AMP - Brighton SEO September 2021Daniel Smullen
AMP is no longer a requirement of the top stories carousel on Google. And with the introduction of page experience as a ranking factor, this talk intends to provide insights into behind the scenes of a major news publisher de-commissioning AMP. Revealing what happens to top stories performance as well as the pros and cons of the AMP framework for publishers as well as it’s future in a new page experience world in News SEO.
How not to fail at your international expansionLidia Infante
In this presentation Lidia explores the strategic and technical factors that will make or break your international SEO efforts to truly localise your business and make it a global success.
- Why do businesses expand internationally?
- Why the rinse and repeat approach fails
- 5 tips to make your international strategy succeed
Additional resources:
- How to do keyword research in a language you don’t speak: https://riseatseven.com/blog/international-keyword-research/
- The beginner’s guide to keyword mapping (with template) https://riseatseven.com/blog/keyword-mapping-guide/
- 5 key factors to a successful international SEO strategy https://riseatseven.com/blog/international-seo-strategy/
Technip Energies Italy: Planning is a graph matterNeo4j
Neo4j and Technip Energies Italy executed an Innovation Lab Sprint. The goal of the laboratory has been to frame, design and prototype the use case identified by their colleagues of Planning, Equipment and Construction disciplines, by applying Knowledge Graph technology, as the way to connect the data to gain information and insights as an immediate value, that is:
– capturing engineering deliverable milestone chain by gaining insights into a schedule
– performing reasoning on information, evidence and data
– extracting insights from data
Building a Search Intent-Driven Website Architecture (SEO Mastery Summit 2022...LazarinaStoyanova
In this talk, given for SEO Mastery Summit 2022, I:
- propose a continuous improvement planning approach for web architecture
- highlight the importance of goal-driven site architecture planning
- highlight the relationship between site goals, user personas, and search intent
- demonstrate ways of considering search intent in both page planning and structure planning
- highlight the benefits of having an intent-driven website architecture
- highlight different ways to troubleshoot for intent mismatch
- recommend some further reading to get set you on a path of success ✨
If you want to read the talk write-up, head over to: https://lazarinastoy.com/building-a-search-intent-driven-website-architecture/
The Knowledge Graph. What is it? How does it work? How do you get in? @YoastConJason Barnard
Google’s Knowledge Graph is increasingly pivotal in their search algorithm. Jason Barnard explains the why and how of the Knowledge Graph and gives the audience practical tips and tricks to fully exploit the opportunities it offers.
This talk will give an overview of Apache Nutch, its main components, how it fits with other Apache projects and its latest developments.
Apache Nutch was started exactly 10 years ago and was the starting point for what later became Apache Hadoop and also Apache Tika. Nutch is nowadays the tool of reference for large scale web crawling.
In this talk I will give an overview of Apache Nutch and describe its main components and how Nutch fits with other Apache projects such as Hadoop, SOLR or Tika.
The second part of the presentation will be focused on the latest developments in Nutch and the changes introduced by the 2.x branch with the use of Apache GORA as a front end to various NoSQL datastores.
Intelligent web crawling
Denis Shestakov, Aalto University
Slides for tutorial given at WI-IAT'13 in Atlanta, USA on November 20th, 2013
Outline:
- overview of web crawling;
- intelligent web crawling;
- open challenges
Introduction to memcached, a caching service designed for optimizing performance and scaling in the web stack, seen from perspective of MySQL/PHP users. Given for 2nd year students of professional bachelor in ICT at Kaho St. Lieven, Gent.
Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl
Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
24. CONSTRAINTS
• Politeness
Wednesday, September 14, 2011
25. CONSTRAINTS
• Politeness
• Distributed
Wednesday, September 14, 2011
26. CONSTRAINTS
• Politeness
• Distributed
• Linear Scalability
Wednesday, September 14, 2011
27. CONSTRAINTS
• Politeness
• Distributed
• Linear Scalability
• Even partitioning
Wednesday, September 14, 2011
28. CONSTRAINTS
• Politeness
• Distributed
• Linear Scalability
• Even partitioning
• Minimum overlap
Wednesday, September 14, 2011
29. CONSTRAINTS
• Politeness it’s easy to burden
• Distributed
small servers
• Linear Scalability
• Even partitioning
• Minimum overlap
Wednesday, September 14, 2011
30. CONSTRAINTS
• Politeness
• Distributed (for any significant
crawl)
• Linear Scalability
• Even partitioning
• Minimum overlap
Wednesday, September 14, 2011
31. CONSTRAINTS
• Politeness
• Distributed
• Linear Scalability n machines =
n*m pages-per-second
• Even partitioning
• Minimum overlap
Wednesday, September 14, 2011
32. CONSTRAINTS
• Politeness
• Distributed
• Linear Scalability
• Even partitioning every machine should
perform equal work
• Minimum overlap
Wednesday, September 14, 2011
33. CONSTRAINTS
• Politeness
• Distributed
• Linear Scalability
• Even partitioning
• Minimum overlap crawl each page
exactly once
Wednesday, September 14, 2011
34. CONSTRAINTS
• Politeness
• Distributed
• Linear Scalability
• Even partitioning
• Minimum overlap
Wednesday, September 14, 2011
36. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
37. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
38. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
39. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
40. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
41. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
42. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
43. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
44. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
45. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
46. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
47. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
48. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
49. architecture overview
CRAWL
FETCHER INTERNET
URLs Web Data
PLANNER
URL
QUEUE
Web Data
STORAGE
Web Data
Wednesday, September 14, 2011
57. challenges:
DNS Lookup
Wednesday, September 14, 2011
58. challenges:
DNS Lookup
URLs Crawled
Wednesday, September 14, 2011
59. challenges:
DNS Lookup
URLs Crawled
Politeness
Wednesday, September 14, 2011
60. challenges:
DNS Lookup
URLs Crawled
Politeness
URL Frontier
Wednesday, September 14, 2011
61. challenges:
DNS Lookup
URLs Crawled
Politeness
URL Frontier
Queueing URLs
Wednesday, September 14, 2011
62. challenges:
DNS Lookup
URLs Crawled
Politeness
URL Frontier
Queueing URLs
Extracting URLs
Wednesday, September 14, 2011
63. challenges:
DNS LOOKUP
Wednesday, September 14, 2011
64. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
65. challenges:
DNS LOOKUP
can easily be a bottleneck
Wednesday, September 14, 2011
66. challenges:
DNS LOOKUP
• consider running your own DNS servers
• djbdns
• PowerDNS
• etc.
Wednesday, September 14, 2011
67. challenges:
DNS LOOKUP
• be aware of software limitations
• gethostbyaddr is synchronized
• same with many “default” DNS clients
Wednesday, September 14, 2011
68. challenges:
DNS LOOKUP
You’ll know when you need it
Wednesday, September 14, 2011
69. challenges:
URLs CRAWLED
Wednesday, September 14, 2011
70. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
71. challenges:
URLs CRAWLED
1 machine, store in memory
Wednesday, September 14, 2011
72. challenges:
URLs CRAWLED
1 machine, store in memory
NAPKIN CALCULATION
Wednesday, September 14, 2011
73. challenges:
URLs CRAWLED
1 machine, store in memory
NAPKIN CALCULATION
~50 bytes per URL
e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations
Wednesday, September 14, 2011
74. challenges:
URLs CRAWLED
1 machine, store in memory
NAPKIN CALCULATION
~50 bytes per URL
e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations
+8 bytes for time-last-crawled
as long e.g. System.currentTimeMillis() -> 1314392455712
Wednesday, September 14, 2011
75. challenges:
URLs CRAWLED
1 machine, store in memory
NAPKIN CALCULATION
~50 bytes per URL
e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations
+8 bytes for time-last-crawled
as long e.g. System.currentTimeMillis() -> 1314392455712
x 100 million
Wednesday, September 14, 2011
76. challenges:
URLs CRAWLED
1 machine, store in memory
NAPKIN CALCULATION
~50 bytes per URL
e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations
+8 bytes for time-last-crawled
as long e.g. System.currentTimeMillis() -> 1314392455712
x 100 million
=~ 5.4 gigabytes
Wednesday, September 14, 2011
77. can we do better?
Wednesday, September 14, 2011
82. BLOOM FILTERS
answers either:
• yes, probably
• definitely not
Wednesday, September 14, 2011
83. BLOOM FILTERS
Have we crawled: http://www.xcombinator.com?
answers either:
• yes, probably
• definitely not
Wednesday, September 14, 2011
84. BLOOM FILTERS
Have we crawled: http://www.xcombinator.com?
answers either:
• yes, probably
• definitely not
Wednesday, September 14, 2011
85. challenges:
URLs CRAWLED
1 machine, bloom filter
100 million URLs
1 in 100 million chance
of false positive
see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8
Wednesday, September 14, 2011
86. challenges:
URLs CRAWLED
1 machine, bloom filter
NAPKIN CALCULATION
100 million URLs
1 in 100 million chance
of false positive
see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8
Wednesday, September 14, 2011
87. challenges:
URLs CRAWLED
1 machine, bloom filter
NAPKIN CALCULATION
100 million URLs
1 in 100 million chance
of false positive
=~ 457 megabytes
see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8
Wednesday, September 14, 2011
123. benefits:
~ 1/(n+1) URLs move on add/remove
virtual nodes help skew
robust (no SOP)
Wednesday, September 14, 2011
124. drawbacks:
naive solution won’t work for large sites
Wednesday, September 14, 2011
125. further reading:
Chord: A Scalable Peer-to-Peer Lookup Protocol for
Internet Applications (2001) Stoica et al.
Dynamo: Amazon’s Highly Available Key-value Store, SOSP
2007
Tapestry: A Resilient Global-Scale Overlay for Service
Deployment (2004) Zhao et al.
Wednesday, September 14, 2011
126. challenges:
QUEUEING URLS
Wednesday, September 14, 2011
177. challenges:
EXTRACTING URLS
be prepared:
Wednesday, September 14, 2011
178. challenges:
EXTRACTING URLS
be prepared:
use a streaming XML parser
Wednesday, September 14, 2011
179. challenges:
EXTRACTING URLS
be prepared:
use a streaming XML parser
use a library that handle’s bad markup
Wednesday, September 14, 2011
180. challenges:
EXTRACTING URLS
be prepared:
use a streaming XML parser
use a library that handle’s bad markup
be aware that URLs aren’t ASCII
Wednesday, September 14, 2011
181. challenges:
EXTRACTING URLS
be prepared:
use a streaming XML parser
use a library that handle’s bad markup
be aware that URLs aren’t ASCII
use a URL normalizer
Wednesday, September 14, 2011