Intelligent web crawling
Denis Shestakov, Aalto University
Slides for tutorial given at WI-IAT'13 in Atlanta, USA on November 20th, 2013
Outline:
- overview of web crawling;
- intelligent web crawling;
- open challenges
Tutorial given at ICWE'13, Aalborg, Denmark on 08.07.2013
Abstract:
Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. In this tutorial, we will introduce the audience to five topics: architecture and implementation of high-performance web crawler, collaborative web crawling, crawling the deep Web, crawling multimedia content and future directions in web crawling research.
To cite this tutorial:
Please refer to http://dx.doi.org/10.1007/978-3-642-39200-9_49
PPC Restart 2023: André Heller - Co musí o Google Analytics 4 vědět každý PPC...Taste
Datum 31.6.2023 má v kalendáři zapsaný už každý analytik a snad i každý PPC specialista. Universal Analytics končí a bez výmluv všichni přecházíme na Google Analytics 4. Co Google Analytics 4 přináší, ale i berou PPC specialistům? Na přednášce si řekneme, proč jsou pro vás lepší a proč se oprostit od snahy napodobovat Universal Analytics.
Data Restart 2022: Michal Schejbal - Problematika předávání dat do třetích ze...Taste
Konec podpory cookies třetích stran není jediným úskalím, které musí on-line podnikatelé v současné době řešit. Práci jim totiž značně komplikují i rozhodnutí evropských orgánů ohledně (ne)předávání dat evropských uživatelů do USA a dalších zemí. Podíváme se na aktuální praxi a na možnosti, které máte dnes a které vás čekají v budoucnu.
Smart Data Webinar: Stepping Into Data ScienceDATAVERSITY
What exactly is Data Science and what does it take to become a Data Scientist?
It's an amalgamation of analytical, computational, and statistical skills that enable one to draw well-founded inference from data and glean truly actionable insight. It's what all companies will need to take advantage of in order to survive in the new millennium, and why the demand and median salaries for Data Scientists continue to rise.
This webinar is an introduction for all those interested in learning about the field of Data Science and -more specifically- those interested in delving into that world. We will overview various use cases commonly found in the workplace where data provides an opportunity to inform decision making from customer base growth and customer churn to product pricing and risk evaluation. We will then discuss the tools and techniques used to approach and, ultimately, solve these problems.
Webinar attendees will leave with sufficient awareness of the scope and landscape to get started becoming a Data Scientist, along with the resources to take subsequent steps in the field.
All of the tools will come from the open-source community, using Python and R in Jupiter Notebooks; enabling fast and easy analysis and collaboration, along with free access for all attendees to develop their skills (not to mention the same tools that all Data Scientists use from Silicon Valley to London).
Data Restart 2022: Roman Appeltauer - Aktivace first-party dat pomocí SGTMTaste
Server-side GTM není jen k měření ze serveru. Skvěle slouží i jako real-time integrační nástroj, který šetří kapacity a focus vývojářů, a přitom efektivně a kontrolovaně předává data mezi aplikacemi, které si samy povídat neumí, nebo ne tak, jak potřebujete.
Tutorial given at ICWE'13, Aalborg, Denmark on 08.07.2013
Abstract:
Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. In this tutorial, we will introduce the audience to five topics: architecture and implementation of high-performance web crawler, collaborative web crawling, crawling the deep Web, crawling multimedia content and future directions in web crawling research.
To cite this tutorial:
Please refer to http://dx.doi.org/10.1007/978-3-642-39200-9_49
PPC Restart 2023: André Heller - Co musí o Google Analytics 4 vědět každý PPC...Taste
Datum 31.6.2023 má v kalendáři zapsaný už každý analytik a snad i každý PPC specialista. Universal Analytics končí a bez výmluv všichni přecházíme na Google Analytics 4. Co Google Analytics 4 přináší, ale i berou PPC specialistům? Na přednášce si řekneme, proč jsou pro vás lepší a proč se oprostit od snahy napodobovat Universal Analytics.
Data Restart 2022: Michal Schejbal - Problematika předávání dat do třetích ze...Taste
Konec podpory cookies třetích stran není jediným úskalím, které musí on-line podnikatelé v současné době řešit. Práci jim totiž značně komplikují i rozhodnutí evropských orgánů ohledně (ne)předávání dat evropských uživatelů do USA a dalších zemí. Podíváme se na aktuální praxi a na možnosti, které máte dnes a které vás čekají v budoucnu.
Smart Data Webinar: Stepping Into Data ScienceDATAVERSITY
What exactly is Data Science and what does it take to become a Data Scientist?
It's an amalgamation of analytical, computational, and statistical skills that enable one to draw well-founded inference from data and glean truly actionable insight. It's what all companies will need to take advantage of in order to survive in the new millennium, and why the demand and median salaries for Data Scientists continue to rise.
This webinar is an introduction for all those interested in learning about the field of Data Science and -more specifically- those interested in delving into that world. We will overview various use cases commonly found in the workplace where data provides an opportunity to inform decision making from customer base growth and customer churn to product pricing and risk evaluation. We will then discuss the tools and techniques used to approach and, ultimately, solve these problems.
Webinar attendees will leave with sufficient awareness of the scope and landscape to get started becoming a Data Scientist, along with the resources to take subsequent steps in the field.
All of the tools will come from the open-source community, using Python and R in Jupiter Notebooks; enabling fast and easy analysis and collaboration, along with free access for all attendees to develop their skills (not to mention the same tools that all Data Scientists use from Silicon Valley to London).
Data Restart 2022: Roman Appeltauer - Aktivace first-party dat pomocí SGTMTaste
Server-side GTM není jen k měření ze serveru. Skvěle slouží i jako real-time integrační nástroj, který šetří kapacity a focus vývojářů, a přitom efektivně a kontrolovaně předává data mezi aplikacemi, které si samy povídat neumí, nebo ne tak, jak potřebujete.
Data Restart 2022: Dominik Kosorin a Lukáš Šmol - Czech Ad IDTaste
Konec podpory cookies třetích stran znamená ohromný zásah do fungování otevřeného reklamního ekosystému. Czech Ad ID, společný projekt Seznamu a sdružení CPEx, nabízí způsob, jak pro nezávislé hráče a technologie zachovat možnost vzájemné spolupráce na poli digitální reklamy. Jde o celosvětově unikátní projekt, který znamená nejen evoluci v kvalitě uživatelské identity, ale zároveň klade důraz na ochranu soukromí.
Data Restart 2022: David Voráček - Příprava mediálního domu na dobu po konci ...Taste
V jaké podobě vydavatelské domy přežijí. Jak se změny dotknou jich, jejich čtenářů a uživatelů a inzerentů. Jaké možnosti práce s cílovou skupinou budou možné a jaká data budou k dispozici. To vše s Vámi projde David Voráček, obchodní ředitel a člen představenstva mediálního domu Economie.
PPC Restart 2022: Jan Janoušek - Využijte maximální potenciál kampaně Perform...Taste
Koncem minulého roku se do našich systémů dostala nová kampaň Performance Max. Tento typ kampaně nám od základu změnil pohled na účty v Google Ads. Vzhledem k tomu, že se stále do velké míry jedná o blackbox, musíme čím dál více hledat skryté nástrahy. Proto jsme se v Tastu na tuto problematiku zaměřili a přiblížíme ji i vám.
Google Analytics 4 is not simply the upgrade to Universal Analytics. It is an entirely new version and redesign of the tool. It is the new generation of Google Analytics that will allow you to collect data from both apps and web.
In this webinar with our Head of Digital Analytics, Benoit Weber, you will discover the tool, understand how to use and navigate the new features and interface, and run through the main differences with Universal Analytics.
Data Restart 2022: Radek Kupr - Consent Rate aneb jak jsme vyřešili propad datTaste
Jak vyhodnotit, zda je propad dat způsoben zamítnutím souhlasů uživatelů s cookies nebo jde o vývojový trend? Jak velký propad skutečně je, o kolik konverzí přicházíte a proč řešit consent rate? Podíváme se na to, jak tuto metriku měřit, proč se jí nelze vyhnout a jaké řešení jsme v Taste našli.
PPC Restart 2022: Julie Kneblová & Petr Bureš - Jak maximalizovat výkon RSA r...Taste
Konec ETA byla pro PPC specialisty jedna z nejdůležitější změn tohoto roku v Google Ads (hned vedle PMax). Petr a Julie z uLab se zaměří na analýzu dopadu RSA na jejich reklamní účty a představí praktické tipy, jak reklamy optimalizovat pro co nejlepší výkon a dosah.
Data Restart 2022: Pavel Jašek - Jak se řídí výkonnostní marketing s nedokona...Taste
Digitální marketing přichází o čím dál více dat, ať již kvůli omezení cookies, rozvoji prohlížečů nebo zvyšujícím se požadavkům uživatelů na soukromí. Marketéři potřebují zmírnit dopad těchto změn a přitom držet svou škálu. Jaké jsou zkušenosti marketérů a rozdíly u těch, kteří se na změny připravovali poctivě?
Analýza zákazníků v E-commerce | Praktický návod "Jak analýzu akvizice a rete...Petr Bureš
Jak analyzovat své zákazníky a dozvědět se, jak se vám daří jejich akvizice nebo retence? Projděte si postup analýzy, které bude důležitým vstupem pro vaši marketingovou i kampaňovou strategii.
Google Analytics Academy serisinde bu hafta;
https://www.notion.so/Kaynak-D-k-manlar-4f92b5b37e6a4280a238591ed8066f1a
Google Analytics üzerinde topladığımız veriler neden yanıltıcı olabiliyor? Nelere dikkat etmeliyiz? üzerine sohbet ettik.
GA4.0 'ın "The next generation of Google Analytics" olarak anılmasının sebepleri neler?
GA4.0 ile hayatımızda neler değişecek? Neden önemli?
Mevcut hesabı nasıl upgrade edebiliriz? Hangi şirketler önceliklendirmeli?
GA4.0 ile birlikte daha çok duyacağımız "engagement rate", "automatic events", "server side" terimleri
Data Restart 2022: Hana Bartoňková a Vojtěch Říha - Kolik mi vydělá jeden člá...Taste
Ukážeme náš přístup k řešení jednoho z cílů mediálních domů, kterým je zjistit, jak jsou jejich články úspěšné. Odpovědi, které nabízí webová analytika v metrikách výkonu (např. zobrazení stránky) nebo kvality (např. medián času stráveného na článku), jsme obohatili o inzertní metriky v podobě reklamních impresí a zisků.
Data Restart 2022: Linda Appeltauer - CRM ve světě bez cookies: příležitosti ...Taste
Customer Relationship Management je už z názvu úplně jiná disciplína, než výkonnostní marketing. Přesto se velmi prolínají a budou na sobě stále více závislé. Není snadné přepínat z uvažování o klicích, CPC a konverzích na zákazníky, jejich potřeby a správné načasování. Mám pro vás tipy, jak začít, na co nezapomínat, a co průběžně kontrolovat.
PPC Restart 2023: Matouš Ledvina - AI jako klíč pro efektivní marketingTaste
Je evidentní, že se čím dále přesouváme do světa poháněného umělou inteligencí. Pojďme si společně projít jak AI využít ve svůj prospěch v rámci Google produktů a maximalizovat tak výkon vašich marketingových aktivit.
PPC Date #4: David Janoušek - Performance Max frameworkTaste
Jak bude vypadat budoucnost výkonnostních kampaní v Google, jaká zde bude role specialisty a jak s tímto stavem pohnul nový druh kampaně Performance Max? Vyplatí se kampaně Performance Max optimalizovat či pouze zapnout a modlit se, že to vyjde? Na to se podíváme v přednášce zaměřené na framework Performance Max, kde rozebereme otázky větvení Asset groups či práci s Insights a Listing groups.
How to Accurately Track Marketing Campaigns + Free Campaign Tagging ToolIn Marketing We Trust
Learn How to Accurately Track Marketing Campaigns with Tracking Guru Ben Weber, Head of Analytics at In Marketing We Trust (Travel Massive LIVE sponsor).
We'll be answering:
*Why do we need campaign tracking?
*What are UTM parameters?
*Where to find campaigns in Google Analytics?
And you'll learn how to use UTM for:
*Email
*Google ads and other paid campaigns
*Social media campaigns
*Offline campaigns
Learn about what is search engine optimization and how seo process works such as Google Crawling, Indexing, OnPage Optimization, Offpage Optimization (Link Building) etc.
SearchLove San Diego 2018 | Will Critchlow | From the Horse’s Mouth: What We ...Distilled
If you pay close enough attention, you can learn all kinds of things from what Google does and doesn’t say in public. From patents to official statements, to comments that Googlers leave on message boards, there is a wealth of information out there that hints at what they really think.
In this presentation, Will is going to work through some of the most significant official announcements and the most insight-heavy comments and leaks of Google’s first 20 years. You’ll come away from this presentation not only with a deeper understanding of the search giant, but also with the tools to understand and interpret future statements and leaks.
Data Restart 2022: Dominik Kosorin a Lukáš Šmol - Czech Ad IDTaste
Konec podpory cookies třetích stran znamená ohromný zásah do fungování otevřeného reklamního ekosystému. Czech Ad ID, společný projekt Seznamu a sdružení CPEx, nabízí způsob, jak pro nezávislé hráče a technologie zachovat možnost vzájemné spolupráce na poli digitální reklamy. Jde o celosvětově unikátní projekt, který znamená nejen evoluci v kvalitě uživatelské identity, ale zároveň klade důraz na ochranu soukromí.
Data Restart 2022: David Voráček - Příprava mediálního domu na dobu po konci ...Taste
V jaké podobě vydavatelské domy přežijí. Jak se změny dotknou jich, jejich čtenářů a uživatelů a inzerentů. Jaké možnosti práce s cílovou skupinou budou možné a jaká data budou k dispozici. To vše s Vámi projde David Voráček, obchodní ředitel a člen představenstva mediálního domu Economie.
PPC Restart 2022: Jan Janoušek - Využijte maximální potenciál kampaně Perform...Taste
Koncem minulého roku se do našich systémů dostala nová kampaň Performance Max. Tento typ kampaně nám od základu změnil pohled na účty v Google Ads. Vzhledem k tomu, že se stále do velké míry jedná o blackbox, musíme čím dál více hledat skryté nástrahy. Proto jsme se v Tastu na tuto problematiku zaměřili a přiblížíme ji i vám.
Google Analytics 4 is not simply the upgrade to Universal Analytics. It is an entirely new version and redesign of the tool. It is the new generation of Google Analytics that will allow you to collect data from both apps and web.
In this webinar with our Head of Digital Analytics, Benoit Weber, you will discover the tool, understand how to use and navigate the new features and interface, and run through the main differences with Universal Analytics.
Data Restart 2022: Radek Kupr - Consent Rate aneb jak jsme vyřešili propad datTaste
Jak vyhodnotit, zda je propad dat způsoben zamítnutím souhlasů uživatelů s cookies nebo jde o vývojový trend? Jak velký propad skutečně je, o kolik konverzí přicházíte a proč řešit consent rate? Podíváme se na to, jak tuto metriku měřit, proč se jí nelze vyhnout a jaké řešení jsme v Taste našli.
PPC Restart 2022: Julie Kneblová & Petr Bureš - Jak maximalizovat výkon RSA r...Taste
Konec ETA byla pro PPC specialisty jedna z nejdůležitější změn tohoto roku v Google Ads (hned vedle PMax). Petr a Julie z uLab se zaměří na analýzu dopadu RSA na jejich reklamní účty a představí praktické tipy, jak reklamy optimalizovat pro co nejlepší výkon a dosah.
Data Restart 2022: Pavel Jašek - Jak se řídí výkonnostní marketing s nedokona...Taste
Digitální marketing přichází o čím dál více dat, ať již kvůli omezení cookies, rozvoji prohlížečů nebo zvyšujícím se požadavkům uživatelů na soukromí. Marketéři potřebují zmírnit dopad těchto změn a přitom držet svou škálu. Jaké jsou zkušenosti marketérů a rozdíly u těch, kteří se na změny připravovali poctivě?
Analýza zákazníků v E-commerce | Praktický návod "Jak analýzu akvizice a rete...Petr Bureš
Jak analyzovat své zákazníky a dozvědět se, jak se vám daří jejich akvizice nebo retence? Projděte si postup analýzy, které bude důležitým vstupem pro vaši marketingovou i kampaňovou strategii.
Google Analytics Academy serisinde bu hafta;
https://www.notion.so/Kaynak-D-k-manlar-4f92b5b37e6a4280a238591ed8066f1a
Google Analytics üzerinde topladığımız veriler neden yanıltıcı olabiliyor? Nelere dikkat etmeliyiz? üzerine sohbet ettik.
GA4.0 'ın "The next generation of Google Analytics" olarak anılmasının sebepleri neler?
GA4.0 ile hayatımızda neler değişecek? Neden önemli?
Mevcut hesabı nasıl upgrade edebiliriz? Hangi şirketler önceliklendirmeli?
GA4.0 ile birlikte daha çok duyacağımız "engagement rate", "automatic events", "server side" terimleri
Data Restart 2022: Hana Bartoňková a Vojtěch Říha - Kolik mi vydělá jeden člá...Taste
Ukážeme náš přístup k řešení jednoho z cílů mediálních domů, kterým je zjistit, jak jsou jejich články úspěšné. Odpovědi, které nabízí webová analytika v metrikách výkonu (např. zobrazení stránky) nebo kvality (např. medián času stráveného na článku), jsme obohatili o inzertní metriky v podobě reklamních impresí a zisků.
Data Restart 2022: Linda Appeltauer - CRM ve světě bez cookies: příležitosti ...Taste
Customer Relationship Management je už z názvu úplně jiná disciplína, než výkonnostní marketing. Přesto se velmi prolínají a budou na sobě stále více závislé. Není snadné přepínat z uvažování o klicích, CPC a konverzích na zákazníky, jejich potřeby a správné načasování. Mám pro vás tipy, jak začít, na co nezapomínat, a co průběžně kontrolovat.
PPC Restart 2023: Matouš Ledvina - AI jako klíč pro efektivní marketingTaste
Je evidentní, že se čím dále přesouváme do světa poháněného umělou inteligencí. Pojďme si společně projít jak AI využít ve svůj prospěch v rámci Google produktů a maximalizovat tak výkon vašich marketingových aktivit.
PPC Date #4: David Janoušek - Performance Max frameworkTaste
Jak bude vypadat budoucnost výkonnostních kampaní v Google, jaká zde bude role specialisty a jak s tímto stavem pohnul nový druh kampaně Performance Max? Vyplatí se kampaně Performance Max optimalizovat či pouze zapnout a modlit se, že to vyjde? Na to se podíváme v přednášce zaměřené na framework Performance Max, kde rozebereme otázky větvení Asset groups či práci s Insights a Listing groups.
How to Accurately Track Marketing Campaigns + Free Campaign Tagging ToolIn Marketing We Trust
Learn How to Accurately Track Marketing Campaigns with Tracking Guru Ben Weber, Head of Analytics at In Marketing We Trust (Travel Massive LIVE sponsor).
We'll be answering:
*Why do we need campaign tracking?
*What are UTM parameters?
*Where to find campaigns in Google Analytics?
And you'll learn how to use UTM for:
*Email
*Google ads and other paid campaigns
*Social media campaigns
*Offline campaigns
Learn about what is search engine optimization and how seo process works such as Google Crawling, Indexing, OnPage Optimization, Offpage Optimization (Link Building) etc.
SearchLove San Diego 2018 | Will Critchlow | From the Horse’s Mouth: What We ...Distilled
If you pay close enough attention, you can learn all kinds of things from what Google does and doesn’t say in public. From patents to official statements, to comments that Googlers leave on message boards, there is a wealth of information out there that hints at what they really think.
In this presentation, Will is going to work through some of the most significant official announcements and the most insight-heavy comments and leaks of Google’s first 20 years. You’ll come away from this presentation not only with a deeper understanding of the search giant, but also with the tools to understand and interpret future statements and leaks.
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationDenis Shestakov
Full-text of my PhD dissertation titled "Search Interfaces on the Web: Querying and Characterizing" defended in ICT-Building, Turku, Finland on 12.06.2008
Thesis contributions:
* New methods for deep Web characterization
* Estimating the scale of a national segment of the Web
* Building a publicly available dataset describing >200 web databases on the Russian Web
* Designing and implementing the I-Crawler, a system for automatic finding and classifying search interfaces
* Technique for recognizing and analyzing JavaScript-rich and non-HTML searchable forms
* Introducing a data model for representing search interfaces and result pages
* New user-friendly and expressive form query language for querying search interfaces and extracting data from result pages
* Designing and implementing a prototype system for querying web databases
* Bibliography with over 110 references to publications in the area of deep Web
Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl
Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.
Presentation of a descriptive anaysis of the DCI from Thomson Reuters by Daniel Torres-Salinas, Evaristo Jiménez-Contreras and Nicolás Robinson-García at the STI Conference held in Leiden (The Netherlands) 3-5 september 2014 sti2014.cwts.nl
Exploring Citation Networks to Study Intertextuality in ClassicsMatteo Romanello
Referring is such an essential part of scholarly activity across disciplines that it has been regarded by John Unsworth (2000) as one of the scholarly primitives. There is, however, a kind of citation whose potential has not been fully exploited to date, despite the attention they recently received within Digital Classics research (Romanello, Boschetti, and Crane 2009; Smith 2010; Romanello 2011). These are called “canonical citations” and are the references commonly used to refer to passages of ancient texts. Given their importance to classicists, Crane et al. (2009) have argued, services for extracting and exploiting them should be part of the Cyberinfrastructure for Classics.
In this paper I discuss the various aspects of making such citations–together with the network of links they create–computable. Firstly, I will present the characteristics of such citations by showing how their semantics can be modeled by means of a formal ontology. Once such an ontology is created and populated, it can be used by a machine as a surrogate for domain knowledge in order to make inferences about texts and citations.
Secondly, I will illustrate how an expert system that captures canonical citations and their meaning from modern journal papers can be implemented by using Natural Language Processing techniques that are well known in Computer Science. I will then present two resources that were developed for this task and made available under Open Source licenses: 1) a manually corrected, multilingual corpus of approximately 30,000 tokens drawn from L’Année Philologique with annotated Named Entities; 2) a machine learning-based classifier that can be trained with this corpus to extract from texts canonical citations and mentions of ancient authors and works.
Finally, I will show some examples of how the citation network so extracted– consisting of journal papers and the ancient texts they refer to–can be exploited to offer scholars new ways and tools to studying intertexuality.
References
Crane, Gregory, Brent Seales, and Melissa Terras. 2009. “Cyberinfrastructure for Classical Philology.” Digital Humanities Quarterly 3.
Romanello, Matteo. 2011. “New Value-Added Services for Electronic Journals in Classics.” JLIS.it 2. doi:10.4403/jlis.it-4603.
Romanello, Matteo, Federico Boschetti, and Gregory Crane. 2009. “Citations in the digital library of classics: extracting canonical references by using conditional random fields.” In , 80–87. Morristown, NJ, USA: Association for Computational Linguistics.
Smith, Neel. 2010. “Digital Infrastructure and the Homer Multitext Project.” In Digital Research in the Study of Classical Antiquity, ed. Gabriel Bodard and Simon Mahony, 121–137. Burlington, VT: Ashgate Publishing.
Unsworth, John. 2000. “Scholarly Primitives: what methods do humanities researchers have in common, and how might our tools reflect this?.” http://www3.isrl.illinois.edu/~unsworth/Kings.5-00/primitives.html.
This talk will give an overview of Apache Nutch, its main components, how it fits with other Apache projects and its latest developments.
Apache Nutch was started exactly 10 years ago and was the starting point for what later became Apache Hadoop and also Apache Tika. Nutch is nowadays the tool of reference for large scale web crawling.
In this talk I will give an overview of Apache Nutch and describe its main components and how Nutch fits with other Apache projects such as Hadoop, SOLR or Tika.
The second part of the presentation will be focused on the latest developments in Nutch and the changes introduced by the 2.x branch with the use of Apache GORA as a front end to various NoSQL datastores.
Slides from a webinar on webware presented by Mike Qaissaunee and Gordon F. Snyder, Jr. (both of nctt.org). The webinar was hosted by MATEC NetWorks (http://www.matecnetworks.org/) and delivered via Elluminate. Visit MATEC NetWorks to watch the webinar.
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Denis Shestakov
<<< Slides can be found at http://www.slideshare.net/denshe/intelligent-crawling-shestakovwiiat13 >>>
-------------------
Web crawling, a process of collecting web pages in
an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. We start with background on web crawling and the structure of the Web. We then discuss different crawling strategies and describe adaptive web crawling techniques leading to better overall crawl performance. We finally overview some of the challenges in web crawling by presenting such topics as collaborative web crawling, crawling the deep Web and crawling multimedia content. Our goals are to introduce the intelligent systems community to the challenges in web crawling research, present intelligent web crawling approaches, and engage researchers and practitioners for open issues and research problems. Our presentation could be of interest to web intelligence and intelligent agent technology communities as it particularly focuses on the usage of intelligent/adaptive techniques in the web crawling domain.
-------------------
The slides for my presentation on BIG DATA EN LAS ESTADÍSTICAS OFICIALES - ECONOMÍA DIGITAL Y EL DESARROLLO, 2019 in Colombia. I was invited to give a talk about the technical aspect of web-scraping and data collection for online resources.
STUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUEIAEME Publication
The World Wide Web, abbreviated as WWW is global information medium interlinked with hypertext documents accessed via the internet. In a web browser a user can easily search the content by simply filling up a form. As the amount of information in the web is increasing drastically, the search result needs to be increased and it depends completely on the searching engine and the search engines are only as good as the web crawlers that serve up content for the result.
The paper gives an idea of a new hidden web crawling technique that is concerned with filling forms with meaningful values in order to get an appropriate search results
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...M. Atif Qureshi
My presentation slides for paper presented in International Conference on Information Science and Applications, ICISA, Seoul 2010.
Paper link: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5480411&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D5480411
Web Archives and the dream of the Personal Search EngineArjen de Vries
Keynote at the 4th Alexandria Workshop organised by Avishek Anand and Wolfgang Nejdl, L3S, Hannover (Germany). I argue that Web Archives should act as a pivot while revisiting the idea of decentralised search.
See also http://alexandria-project.eu/events/4th-int-alexandria-workshop-19-20-october-2017/
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Denis Shestakov
Lectio Praecursoria on my PhD dissertation titled "Search Interfaces on the Web: Querying and Characterizing" given in ICT building, Turku, Finland on June 12, 2008
Thesis contributions:
* Querying search interfaces
* Deep Web characterization
* Finding web databases
The text of thesis is available at http://www.slideshare.net/denshe/shestakov2008-search-interfacesonthewebqueryingandcharacterizing
Terabyte-scale image similarity search: experience and best practiceDenis Shestakov
Slides for the talk given at IEEE BigData 2013, Santa Clara, USA on 07.10.2013. Full-text paper is available at http://goo.gl/WTJoxm
To cite please refer to http://dx.doi.org/10.1109/BigData.2013.6691637
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Intelligent web crawling
1. INTELLIGENT WEB CRAWLING
WI-IAT 2013 Tutorial
WI-IAT 2013 Tutorial, Atlanta, USA, 20.11.2013
ver 1.8: 10.04.2015
Denis Shestakov
denshe at gmail
Department of Media Technology, Aalto University, Finland
2. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
1/98
References to this tutorial
To cite please use:
D. Shestakov, "Intelligent Web Crawling," IEEE Intelligent
Informatics Bulletin, 14(1), pp. 5-7, 2013.
[BibTeX]
3. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
2/98
Speaker’s Bio
(2009-2013) Postdoc in
Web Services Group,
Aalto University, Finland
PhD thesis (2008) on
limited coverage of web
crawlers
Over ten years of
experience in the area
Tutorials on web crawling
given at SAC’12 and
ICWE’13
Web Services Group in 2011
4. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
3/98
Speaker’s Info
As of 2013: Current:
http://www.linkedin.com/in/dshestakov
http://www.mendeley.com/profiles/
denis-shestakov/
http://www.researchgate.net/profile/
Denis_Shestakov
https://mediatech.aalto.fi/~denis/
5. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
4/98
TUTORIAL OUTLINE
I. OVERVIEW
Web crawling in a nutshell
Web crawling applications
Web size and web link structure
II. INTELLIGENT WEB CRAWLING
Architecture of web crawler
Crawling strategies
Adaptive crawling approaches
III. OPEN CHALLENGES
Crawlers in Web ecosystem
Collaborative web crawling
Deep Web crawling
Crawling multimedia content
6. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
5/98
Links to Tutorial
Slides:
http://goo.gl/woVtQk
http://www.slideshare.net/denshe/presentations
Similar tutorials:
Tutorials on web crawling at ICWE’13 and SAC’12
Their diffs with this tutorial: better overview the topic (parts I
and III), but not cover crawling strategies (part II)
Supporting materials:
http://www.mendeley.com/groups/531771/web-crawling/
7. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
6/98
PART I: OVERVIEW
Visualization of http://media.tkk.fi/webservices by aharef.info applet
8. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
7/98
Outline of Part I
Overview of Web Crawling
Web crawling in a nutshell
Web crawling applications
Web size and web link structure
9. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
8/98
Web Crawling in a Nutshell
Automatic harvesting of web content
Done by web crawlers (also known as robots, bots or
spiders)
Follow a link from a set of links (URL queue), download a
page, extract all links, eliminate already visited, add the
rest to the queue
Then repeat
Set of policies involved (like ’ignore links to images’, etc.)
10. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
9/98
Web Crawling in a Nutshell
Example:
1. Follow http://media.tkk.fi/webservices (vizualization of its
HTML DOM tree below)
2. Extract URLs inside blue bubbles (designating <a> tags)
3. Remove already visited URLs
4. For each non-visited URL, start at Step 1
11. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
10/98
Web Crawling in a Nutshell
In essence: simple and naive process
However, a number of ’restrictions’ imposed make it much
more complicated
Most complexities due to operating environment (Web)
For example, do not overload web servers (challenging as
distribution of web pages on web servers is non-uniform)
Or avoiding web spam (not only useless but consumes
resources and often spoils the collected content)
12. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
11/98
Web Crawling in a Nutshell
Crawler Agents
First in 1993: the Wanderer (written in Perl)
Over different 1100 crawler signatures (User-Agent string
in HTTP request header) mentioned at
http://www.crawltrack.net/crawlerlist.php
Educated guess on overall number of different crawlers –
at least several thousands
Write your own in a few dozens lines of code (using
libraries for URL fetching and HTML parsing)
Or use existing agent: e.g., wget tool (developed from
1996; http://www.gnu.org/software/wget/)
13. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
12/98
Web Crawling in a Nutshell
Crawler Agents
For advanced things, you may modify the code of existing
projects for programming language preferred
Crawlers play a big role on the Web
Bring more traffic to certain web sites than human visitors
Generate sizeable portion of traffic to any (public) web site
Crawler traffic important for emerging web sites
14. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
13/98
Web Crawling in a Nutshell
Classification
General/universal crawlers
Not so many of them, lots of resources required
Big web search engines
Topical/focused crawlers
Pages/sites on certain topic
Crawling all in one specific (i.e., national) web segment is
rather general, though
Batch crawling
One or several (static) snapshots
Incremental/continuous crawling
Re-visiting
Resources divided between fetching newly discovered
pages and re-downloading previously crawled pages
Search engines
15. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
14/98
Applications of Web Crawling
Web Search Engines
Google, Microsoft Bing, (Yahoo), Baidoo, Navier, Yandex,
Ask, ...
One of three underlying technology stacks
16. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
15/98
Applications of Web Crawling
Web Search Engines
One of three underlying technology stacks
BTW, what are the other two and which is the most
’crucial’?
17. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
16/98
Applications of Web Crawling
Web Search Engines
What are the other two and which is the most ’crucial’?
Query processor (particularly, ranking)
18. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
17/98
Applications of Web Crawling
Web Archiving
Digital preservation
“Librarian” look on the Web
The biggest: Internet Archive
Quite huge collections
Batch crawls
Primarily, collection of national web sites – web sites at
country-specific TLDs or physically hosted in a country
There are quite many and some are huge! see the list of
Web Archiving Initiatives at Wikipedia
19. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
18/98
Applications of Web Crawling
Vertical Search Engines
Data aggregating from many sources on certain topic
E.g., apartment search, car search
20. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
19/98
Applications of Web Crawling
Web Data Mining
“To get data to be actually mined”
Usually using focused crawlers
For example, opinion mining
Or digests of current happenings on the Web (e.g., what
music people listen now)
21. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
20/98
Applications of Web Crawling
Web Monitoring
Monitoring sites/pages for changes and updates
22. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
21/98
Applications of Web Crawling
Detection of malicious web sites
Typically a part of anti-virus, firewall, search engine, etc.
service
Building a list of such web sites and inform a user about
potential threat of visiting such
23. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
22/98
Applications of Web Crawling
Web site/application testing
Crawl a web site to check a navigation through it, validity
the links, etc.
Regression/security/... testing a rich internet application
(RIA) via crawling
Checking different application states by simulating possible
user interaction events (e.g., mouse click, time-out)
24. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
23/98
Applications of Web Crawling
Copyright violation detection
Crawl to find (media) items under copyright or links to them
Regular re-visiting ’suspicious’ web sites, forums, etc.
Tasks like finding terrorist chat rooms also go here
25. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
24/98
Applications of Web Crawling
Web Scraping
Extracting particular pieces of information from a group of
typically similar pages
When API to data is not available
Interestingly, scraping might be more preferable even with
API available as scraped data often more clean and
up-to-date than data-via-API
26. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
25/98
Applications of Web Crawling
Web Mirroring
Copying of web sites
Hosting copies on different servers to ensure 24x7
accessibility
27. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
26/98
Industry vs. Academia Divide
In web crawling domain
Huge lag between industrial and academic web crawlers
Research-wise and development-wise
Algorithms, techniques, strategies used in industrial
crawlers (namely, operated by search engines) poorly
known
Industrial crawlers operate on a web-scale
That is, dozens of billions pages
Only a few academic crawlers dealt with more than one
billion pages
Academic scale is rather hundreds of millions
28. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
27/98
Industry vs. Academia
Re-crawling
Batch crawls in academia
Regular re-crawls by industrial crawlers
Evaluation of crawled data
Crucial for corrections/improvements into crawlers
Direct evaluation by users of search engines
To some extent, artificial evaluation of academic crawls
29. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
28/98
Web Size and Structure
Some numbers
Number of pages per host is not uniform: most hosts
contain only a few pages, others contain millions
Roughly 100 links on a page
According to Google statistics (over 4 billions pages,
2010): fetching a page takes 320KB (textual content plus
all embeddings)
Page has 10-100KB of textual (HTML) content on average
One trillion URLs known by Google/Yahoo in 2008
30. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
29/98
Web Size and Structure
Some numbers
20 million web pages in 1995 (indexed by AltaVista)
One trillion (1012) URLs known by Google/Yahoo in 2008
- ’Independent’ search engine called Majestic12
(P2P-crawling) confirms one trillion items
Doesn’t mean one trillion indexed pages
Supposedly, index has dozens times less pages
Cool crawler facts: IRLbot crawler (running on one server)
downloaded 6.4 billion pages over 2 months
Throughput: 1000-1500 pages per second
Over 30 billion discovered URLs
31. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
30/98
Web Size and Structure
Bow-tie model of the Web
Illustration taken from http://dx.doi.org/doi:10.1038/35012155
33. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
32/98
Outline of Part II
Intelligent Web Crawling
Architecture of web crawler
Crawling strategies
Adaptive crawling approaches
34. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
33/98
Architecture of Web Crawler
Crawler crawls the Web
Crawled
URLs
URL Frontier
Seed
URLs
Uncrawled Web
35. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
34/98
Architecture of Web Crawler
Typically in a distributed fashion
Seed
URLs
Crawled
URLs
URL Frontier
crawling thread
Uncrawled Web
36. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
35/98
Architecture of Web Crawler
URL Frontier
Include multiple pages from the same host
Must avoid trying to fetch them all at the same time
Must try to keep all crawling threads busy
Prioritization also helps
37. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
36/98
Architecture of Web Crawler
Crawler Architecture
Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.
38. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
37/98
Architecture of Web Crawler
Content seen?
If page fetched is already in the base/index, don’t process it
Document fingerprints (shingles)
Filtering
Filter out URLs – due to ’politeness’, restrictions on crawl
Fetched robots.txt are cached to avoid fetching them
repeatedly
Duplicate URL Elimination
Check if an extracted+filtered URL has been already
passed to frontier (batch crawling)
More complicated in continuous crawling (different URL
frontier implementation)
39. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
38/98
Architecture of Web Crawler
Distributed Crawling
Run multiple crawl threads, under different processes
(often at different nodes)
Nodes can be geographically distributed
Partition hosts being crawled into nodes
40. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
39/98
Architecture of Web Crawler
Host Splitter
Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.
41. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
40/98
Architecture of Web Crawler
Implementation (in Perl)
Other popular languages: Java, Python, C/C++
42. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
41/98
Architecture of Web Crawler
Crawling objectives
High web coverage
High page freshness
High content quality
High download rate
Internal and External factors
Amount of hardware (I)
Network bandwidth (I)
Rate of web growth (E)
Rate of web change (E)
Amount of malicious content (i.e., spam, duplicates) (E)
43. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
42/98
Crawling Strategies
Download prioritization
Given a period, only a subset of web pages can be
downloaded
“Important” pages first
Hence, need in prioritization
Ordering a queue of URLs to be visited
Strategies (ordering metrics)
Breadth-First, Depth-First
Backlink count
Best-First
PageRank
Shark-Search
44. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
43/98
Crawling Strategies
Breadth-First, Depth-First
Breadth-First search
Implemented with
QUEUE (FIFO)
Pages with shortest
paths first
Depth-First search
Implemented with
STACK (LIFO)
45. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
44/98
Crawling Strategies
Pseudocode for Breadth-First
46. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
45/98
Crawling Strategies
Backlink count
Use the link graph information
Count # of crawled pages that point to a page
Links with highest counts first
47. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
46/98
Crawling Strategies
Best-First
Best link selected based on some criterion
I.e., lexical similarity between topic’s keywords and link’s
source page
Similarity score sim(topic, p) assigned to outgoing links of
page p
Cosine similarity often used
where q is a topic, p is a crawled page, fkq,fkp are frequencies of term k
in q and p
48. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
47/98
Crawling Strategies
Pseudocode for Best-First
49. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
48/98
Crawling Strategies
PageRank
The pagerank of a page is the probability for a random
surfer (who follows links randomly) to be on this page at
any given time
A page’s score (rank) defined by scores of pages with links
to this page
where p is a page, in(p) is a set of pages with links to p, out(d) is a set
of links out of d, γ are damping factor
PageRank of pages periodically recalculated using data
structure with crawled pages
50. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
49/98
Crawling Strategies
Pseudocode for PageRank
51. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
50/98
Crawling Strategies
Shark-Search
More emphasis on web segments where relevant pages
were found
Penalizing segments yielding a few relevant pages
A link’s score defined by a link’s anchor text, text
surrounding a link (link context) and inherited score from
ancestor pages (pages pointing to a page with this link)
Parameters:
d - depth bound
r - relative importance of inherited score versus link
neighbourhood score
52. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
51/98
Crawling Strategies
Pseudocode for Shark-Search
53. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
52/98
Adaptive Crawling
Static vs. adaptive strategies
Strategies presented to this point are static
Not adjust in the course of the crawl
Adaptive (intelligent) crawling
InfoSpiders
Ant-based crawling
54. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
53/98
Adaptive Crawling
InfoSpiders
Independent agents crawling in parallel
HTML parser
Noise word
remover
Stemmer
Document
relevance
assessment
Reproduction
or death
Learning
Link
assessment
and selection
HTML
document
Compact
document
representation
Document
assessment
########## $$$
########## $$$
Term
weights
Neural net
weights
Keyword
vector
Agent
representation
55. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
54/98
Adaptive Crawling
InfoSpiders
Independent agents crawling in parallel
Each agent uses list of keywords (initialized with topic
keywords)
Neural network evaluates new links
Keywords in the vicinity a link used as input
More importance (weight) to those keywords close to a link
Maximum to words in the anchor text
Output is a numerical quality estimate for a link
Link score combined with cosine similarity score (between
agent’s keywords and a page with this link)
56. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
55/98
Adaptive Crawling
InfoSpiders
Each agent has an energy level
Agent moves from a current to a new page if boltzmann
function returns true
where δ is diff between similarity of new and current page to agent’s
keywords
If energy level passes some threshold, an agent
reproduces
Offspring gets the half of parent’s frontier
Offspring keywords mutated (expanded) with most
frequent terms in parent’s current document
57. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
56/98
Adaptive Crawling
Pseudocode for InfoSpiders
58. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
57/98
Adaptive Crawling
Pseudocode for InfoSpiders (cont.)
59. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
58/98
Adaptive Crawling
Ant-based crawling
Motivation: allow crawling agents to communicate with
each other
Follow a model of social insect collective behaviour
Ants leave the pheromone along the followed path
Other ants follow such pheromone trails
A crawler agent follows some path by visiting many URLs
At some moment, a certain amount of pheromone (weight)
can be assigned to sequence of URLs on the followed path
The amount can depend on similarity of visited pages to a
given topic
60. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
59/98
Adaptive Crawling
Ant-based crawling
Ants (crawlers) operate in cycles
During each cycle, agents make a predefined number of
moves (visits of pages)
#moves = constant ∗ #cycle
At the end of each cycle, pheromone intensity values are
updated for the followed path
Agents-ants return to their starting positions
61. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
60/98
Adaptive Crawling
Ant-based crawling
Next link selected based on probability, which is defined by
the corresponding pheromone intensity
If no pheromone information, an agent-ant moves
randomly
62. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
61/98
Adaptive Crawling
Ant-based crawling
Probability of selecting a link
where t is the cycle number, τij (t) is pheromone value between pi and
pj and (i, l) designates the presence of a link from pi to pl
During the cycle, each ant stores the list of visited URLs
If pj was already visited, Pij(t) = 0
At the end of cycle, the list with visited URLs emptied out
63. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
62/98
Adaptive Crawling
Implications
Strategies evaluating links based on their context (text
close by) are not directly applicable to large-scale crawling
I.e., consider crawling of 109 pages within one month
Crawl rate: around 400 documents per second
Around 40000 links per second
Every second 10000-30000 “new” links to be evaluated
(scored) and added to the frontier
Too many even for link’s anchor text evaluation only
65. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
64/98
Outline of Part III
Open Challenges
Crawlers in Web ecosystem
Collaborative web crawling
Deep Web crawling
Crawling multimedia content
66. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
65/98
Crawlers in Web ecosystem
Push vs. Pull model
Web pages accessed via pull model
- HTTP is a pull protocol
That is, a client requests a page from a server
If push, a server would send a page/info to a client
Why Pull?
Pull is just easier for both parties
No ’agreement’ between provider and aggregator
No specific protocols for content providers – serving
content is enough
Perhaps pull model is the reason why the Web is
succeeded while earlier hypertext systems failed
67. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
66/98
Crawlers in Web ecosystem
Why not Push?
Still pull model has several disadvantages
What are these?
68. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
67/98
Crawlers in Web ecosystem
Why not Push?
Still pull model has several disadvantages
Publishing/updating content easier with push: no need in
redundant requests from crawlers
Better control over the content from providers: no need in
crawler politeness
69. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
68/98
Crawlers in Web ecosystem
Crawler politeness
Content providers possess some control over crawlers
Via special protocols to define access to parts of a site
Via direct banning of agents hitting a site too often
70. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
69/98
Crawlers in Web ecosystem
Crawler politeness
Robots.txt says what can(not) be crawled
Sitemaps is newer protocol specifying access restrictions
and other info
No agent should visit any URL starting with
“yoursite/notcrawldir”, except an agent called
“goodsearcher”
Example
User-agent: *
Disallow: yoursite/notcrawldir
User-agent: goodsearcher
Disallow:
71. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
70/98
Collaborative Crawling
Main considerations
Lots of redundant crawling
To get data (often on a specific topic) need to crawl broadly
- Often lack of expertise when large crawl required
- Often, crawl a lot, use only a small subset
Too many redundant requests for content providers
Idea: have one crawler doing very broad and intensive
crawl and many parties accessing the crawled data via API
- Specify filters to select required pages
Crawler as a common service
72. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
71/98
Collaborative Crawling
Some requirements
Filter language for specifying conditions
Efficient filter processing (millions filter to process)
Efficient fetching (hundreds pages per second)
Support real-time requests
73. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
72/98
Collaborative Crawling
New component
Process a stream of documents against a filter index
74. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
73/98
Collaborative Crawling
Filter processing architecture
75. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
74/98
Collaborative Crawling
Filter processing architecture
76. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
75/98
Collaborative Crawling
Based on ’The architecture and implementation of an
extensible web crawler’ by Hsieh, Gribble, Levy, 2010
(illustrations on slides 61-62 from Hsieh’s slides)
E.g., 80legs provides similar crawling services
In a way, it is reconsidering pull/push model of content
delivery on the Web
77. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
76/98
Deep Web Crawling
Visualization of http://amazon.com by aharef.info applet
78. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
77/98
Deep Web Crawling
In a nutshell
Problem is in yellow nodes (designating web form
elements)
79. ●
Deep Web – part of the Web not accessible through search
engines
●
My preferred: Deep Web - content behind web search forms on
publicly available pages
●
Pages with forms themselves are typically accessible/searchable
(=crawled)
1
Content hidden behind HTML forms
Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
80. Why is it important?
Large source of structured data
●
Forms present a search interface over backend databases
Significant gap in search engine coverage
●
Potentially more content that currently searchable
●
More than 10 million distinct HTML forms
●
Likely to increase and more data comes online
Size of the deep Web is unclear
●
500x figures are highly disputable
●
Number of resources is a bit simpler: ~450k databases on the Web in
2004
●
Some part of deep web content crawled/covered by search engines
●
Content can be both searched and browsed via links categorizing
content
●
Business-driven sites (e.g., shopping) typically provide both ways of
access
2Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
81. Can’t pass through the forms (need to specify some values)
I.e., content is “hidden” behind search forms
●
Reason for another name for deep Web: hidden Web
To crawl/access the content behind the following is
required:
●
Identify a search form on a page
●
Fill form with proper values
●
Submit the form
●
Get the result pages
●
Extract links/data from them
Why crawlers not crawl deep Web
3Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
82. Approaches to deep Web crawling
Google’s Deep Web Crawl (2008)
●
Identify search forms
●
Pre-compute all interesting form submissions to each
HTML form
●
Each form submission corresponds to a distinct URL
●
Add URLs for each form submission into search
engine index
●
Allows to reuse existing search engine infrastructure
●
No aim for full coverage of a deep web resource
●
Not all forms (only GET forms) covered
4Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
83. Deep Web site identification
• Task: identify a search form leading to content-rich
web pages
• Surprisingly, quite challenging task
• One of the problems:
●
Detect if form is searchable
5Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
84. Searchable forms
Non-searchable: login forms, those that require user info
Depends: Highly-interactive forms, e.g., airline reservations
What are deep Web resources?
store locations
used cars
radio stations
patents
recipes
6Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
86. Deep Web site identification
• Detect if form is informational
●
Challenging for human too: e.g., assume a form is in
unknown language
• Detection by building/training binary classifiers
• Forms identified as searchable can then be classified into
domains (e.g., car search, apartment search, etc.)
●
Based on form structure (e.g., num.fields)
●
Based on form field labels
• Slow process
●
Done by specific component in offline mode
8Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
87. Crawling JavaScript-rich sites
• Web pages became more responsive, interactive,
user-friendly, etc.
●
Thanks to emergence of new web technologies
such as AJAX
• Besides, they led to wide spread of web applications
(RIAs)
• Challenge for crawlers as they do not
●
Manipulate client-side site
●
Take into account asynchronous communication
with the server
9Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
88. Crawling JavaScript-rich sites
• Very similar to deep Web crawling challenge
●
Content is hard to crawl
●
Direct problem: AJAX/JS-enabled forms are hard to
deal with (e.g., to detect and then generate meaningful
queries)
• Web pages designed for human beings, not for
automatic programs
• JS-code should be processed to get the actual content
●
Dynamically changing
●
Lots of additional resources required (crawler should
be supplemented with JS-interpreter)
10Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
89. Crawling JavaScript-rich sites
• Several techniques for AJAX crawling proposed since
2007/08
●
Focus is either on indexing and searching or on testing
RIAs
• Approach:
●
AJAX-enabled web page/application modeled using
states, events, transitions
●
Crawler uses breadth-first strategy:
●
Triggers the events on a page
●
If the DOM of a page changes then new
state/transition is added to transition graph
●
Back to initial state to invoke the next event
11Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
90. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
89/98
Crawling Multimedia Content
The web is now multimedia platform
Images, video, audio are integral part of web pages (not
just supplementing them)
Almost all crawlers, however, consider it as a textual
repository
One reason: indexing techniques for multimedia doesn’t
reach yet the maturity required by interesting use
cases/applications
Hence, no real need to harvest multimedia
But state-of-the-art multimedia retrieval/computer vision
techniques already provide adequate search quality
E.g., search for images with a cat and a man based on
actual image content (not text around/close to image)
In case of video: set of frames plus audio (can be converted
to textual form)
91. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
90/98
Crawling Multimedia Content
Challenges in crawling multimedia
Bigger load on web sites since files are bigger
More apparent copyright issues
More resources (e.g., bandwidth, storage place) required
from a crawler
More complicated duplicate resolving
Re-visiting policy
92. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
91/98
Crawling Multimedia Content
Scalable Multimedia Web Observatory of ARCOMEM
project (http://www.arcomem.eu)
Focus on web archiving issues
Uses several crawlers
- ’Standard’ crawler for regular web pages
- API crawler to mine social media sources (e.g., Twitter,
Facebook, YouTube, etc.)
- Deep Web crawler able to extract information from
pre-defined web sites
Data can be exported in WARC (Web ARChive) files and in
RDF
93. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
92/98
Future Directions
Collaborative crawling, mixed pull-push model
Scalable adaptive strategies
Understanding site structure
Deep Web crawling
Semantic Web crawling
Media content crawling
Social network crawling
94. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
93/98
References: Crawl Datasets
Use for building your crawls, web graph analysis, web data
mining tasks, etc.
ClueWeb09 Dataset:
- http://lemurproject.org/clueweb09.php/
- One billion web pages, in ten languages
- 5TBs compressed
- Hosted at several cloud services (free license required) or
a copy can be ordered on hard disks (pay for disks)
ClueWeb12:
- Almost 900 millions English web pages
95. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
94/98
References: Crawl Datasets
Use for building your crawls, web graph analysis, web data
mining tasks, etc.
Common Crawl Corpus:
- See http://commoncrawl.org/data/accessing-the-data/
and http://aws.amazon.com/datasets/41740
- Around six billion web pages
- Over 100TB uncompressed
- Available as Amazon Web Services’ public dataset (pay for
processing)
96. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
95/98
References: Crawl Datasets
Use for building your crawls, web graph analysis, web data
mining tasks, etc.
Internet Archive:
- See http://blog.archive.org/2012/10/26/
80-terabytes-of-archived-web-crawl-data-available-for-resea
- Crawl of 2011
- 80TB WARC files
- 2.7 billions pages
- Includes multimedia data
- Available by request
97. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
96/98
References: Crawl Datasets
LAW Datasets:
- http://law.dsi.unimi.it/datasets.php
- Variety of web graphs datasets (nodes, arcs, etc.) including
basic properties of recent Facebook graphs (!)
- Thoroughly studied in a number of publications
ICWSM 2011 Spinn3r Dataset:
- http://www.icwsm.org/data/
- 130mln blog posts and 230mln social media publications
- 2TB compressed
Academic Web Link Database Project:
- http://cybermetrics.wlv.ac.uk/database/
- Crawls of national universities web sites
98. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
97/98
References: Literature
For beginners: Udacity/CS101 course;
http://www.udacity.com/overview/Course/cs101
Intermediate: Chapter 20 of Introduction to Information
Retrieval book by Manning, Raghavan, Schütze;
http://nlp.stanford.edu/IR-book/pdf/20crawl.pdf
Intermediate: Current Challenges in Web Crawling tutorial
at ICWE 2013 by Shestakov; http://www.slideshare.
net/denshe/icwe13-tutorial-webcrawling
Advanced: Web Crawling by Olston and Najork;
http://www.nowpublishers.com/product.aspx?product=
INR&doi=1500000017
99. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
98/98
References: Literature
See relevant publications at Mendeley:
http://www.mendeley.com/groups/531771/web-crawling/
Feel free to join the group!
Check ’Deep Web’ group too
http://www.mendeley.com/groups/601801/deep-web/