Intelligent web crawling

INTELLIGENT WEB CRAWLING
WI-IAT 2013 Tutorial
WI-IAT 2013 Tutorial, Atlanta, USA, 20.11.2013
ver 1.8: 10.04.2015
Denis Shestakov
denshe at gmail
Department of Media Technology, Aalto University, Finland

Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
1/98
References to this tutorial
To cite please use:
D. Shestakov, "Intelligent Web Crawling," IEEE Intelligent
Informatics Bulletin, 14(1), pp. 5-7, 2013.
[BibTeX]

Denis Shestakov
2/98
Speaker’s Bio
(2009-2013) Postdoc in
Web Services Group,
Aalto University, Finland
PhD thesis (2008) on
limited coverage of web
crawlers
Over ten years of
experience in the area
Tutorials on web crawling
given at SAC’12 and
ICWE’13
Web Services Group in 2011

Denis Shestakov
3/98
Speaker’s Info
As of 2013: Current:
http://www.linkedin.com/in/dshestakov
http://www.mendeley.com/profiles/
denis-shestakov/
http://www.researchgate.net/profile/
Denis_Shestakov
https://mediatech.aalto.fi/~denis/

Denis Shestakov
4/98
TUTORIAL OUTLINE
I. OVERVIEW
Web crawling in a nutshell
Web crawling applications
Web size and web link structure
II. INTELLIGENT WEB CRAWLING
Architecture of web crawler
Crawling strategies
Adaptive crawling approaches
III. OPEN CHALLENGES
Crawlers in Web ecosystem
Collaborative web crawling
Deep Web crawling
Crawling multimedia content

Denis Shestakov
5/98
Links to Tutorial
Slides:
http://goo.gl/woVtQk
http://www.slideshare.net/denshe/presentations
Similar tutorials:
Tutorials on web crawling at ICWE’13 and SAC’12
Their diffs with this tutorial: better overview the topic (parts I
and III), but not cover crawling strategies (part II)
Supporting materials:
http://www.mendeley.com/groups/531771/web-crawling/

Denis Shestakov
6/98
PART I: OVERVIEW
Visualization of http://media.tkk.fi/webservices by aharef.info applet

Denis Shestakov
7/98
Outline of Part I
Overview of Web Crawling
Web crawling in a nutshell
Web crawling applications
Web size and web link structure

Denis Shestakov
8/98
Web Crawling in a Nutshell
Automatic harvesting of web content
Done by web crawlers (also known as robots, bots or
spiders)
Follow a link from a set of links (URL queue), download a
page, extract all links, eliminate already visited, add the
rest to the queue
Then repeat
Set of policies involved (like ’ignore links to images’, etc.)

Denis Shestakov
9/98
Example:
1. Follow http://media.tkk.ﬁ/webservices (vizualization of its
HTML DOM tree below)
2. Extract URLs inside blue bubbles (designating <a> tags)
3. Remove already visited URLs
4. For each non-visited URL, start at Step 1

Denis Shestakov
10/98
In essence: simple and naive process
However, a number of ’restrictions’ imposed make it much
more complicated
Most complexities due to operating environment (Web)
For example, do not overload web servers (challenging as
distribution of web pages on web servers is non-uniform)
Or avoiding web spam (not only useless but consumes
resources and often spoils the collected content)

Denis Shestakov
11/98
Crawler Agents
First in 1993: the Wanderer (written in Perl)
Over different 1100 crawler signatures (User-Agent string
in HTTP request header) mentioned at
http://www.crawltrack.net/crawlerlist.php
Educated guess on overall number of different crawlers –
at least several thousands
Write your own in a few dozens lines of code (using
libraries for URL fetching and HTML parsing)
Or use existing agent: e.g., wget tool (developed from
1996; http://www.gnu.org/software/wget/)

Denis Shestakov
12/98
Crawler Agents
For advanced things, you may modify the code of existing
projects for programming language preferred
Crawlers play a big role on the Web
Bring more traffic to certain web sites than human visitors
Generate sizeable portion of traffic to any (public) web site
Crawler traffic important for emerging web sites

Denis Shestakov
13/98
Classiﬁcation
General/universal crawlers
Not so many of them, lots of resources required
Big web search engines
Topical/focused crawlers
Pages/sites on certain topic
Crawling all in one speciﬁc (i.e., national) web segment is
rather general, though
Batch crawling
One or several (static) snapshots
Incremental/continuous crawling
Re-visiting
Resources divided between fetching newly discovered
pages and re-downloading previously crawled pages
Search engines

Denis Shestakov
14/98
Applications of Web Crawling
Web Search Engines
Google, Microsoft Bing, (Yahoo), Baidoo, Navier, Yandex,
Ask, ...
One of three underlying technology stacks

Denis Shestakov
15/98
Web Search Engines
One of three underlying technology stacks
BTW, what are the other two and which is the most
’crucial’?

Denis Shestakov
16/98
Web Search Engines
What are the other two and which is the most ’crucial’?
Query processor (particularly, ranking)

Denis Shestakov
17/98
Web Archiving
Digital preservation
“Librarian” look on the Web
The biggest: Internet Archive
Quite huge collections
Batch crawls
Primarily, collection of national web sites – web sites at
country-speciﬁc TLDs or physically hosted in a country
There are quite many and some are huge! see the list of
Web Archiving Initiatives at Wikipedia

Denis Shestakov
18/98
Vertical Search Engines
Data aggregating from many sources on certain topic
E.g., apartment search, car search

Denis Shestakov
19/98
Web Data Mining
“To get data to be actually mined”
Usually using focused crawlers
For example, opinion mining
Or digests of current happenings on the Web (e.g., what
music people listen now)

Denis Shestakov
20/98
Web Monitoring
Monitoring sites/pages for changes and updates

Denis Shestakov
21/98
Detection of malicious web sites
Typically a part of anti-virus, ﬁrewall, search engine, etc.
service
Building a list of such web sites and inform a user about
potential threat of visiting such

Denis Shestakov
22/98
Web site/application testing
Crawl a web site to check a navigation through it, validity
the links, etc.
Regression/security/... testing a rich internet application
(RIA) via crawling
Checking different application states by simulating possible
user interaction events (e.g., mouse click, time-out)

Denis Shestakov
23/98
Copyright violation detection
Crawl to ﬁnd (media) items under copyright or links to them
Regular re-visiting ’suspicious’ web sites, forums, etc.
Tasks like ﬁnding terrorist chat rooms also go here

Denis Shestakov
24/98
Web Scraping
Extracting particular pieces of information from a group of
typically similar pages
When API to data is not available
Interestingly, scraping might be more preferable even with
API available as scraped data often more clean and
up-to-date than data-via-API

Denis Shestakov
25/98
Web Mirroring
Copying of web sites
Hosting copies on different servers to ensure 24x7
accessibility

Denis Shestakov
26/98
Industry vs. Academia Divide
In web crawling domain
Huge lag between industrial and academic web crawlers
Research-wise and development-wise
Algorithms, techniques, strategies used in industrial
crawlers (namely, operated by search engines) poorly
known
Industrial crawlers operate on a web-scale
That is, dozens of billions pages
Only a few academic crawlers dealt with more than one
billion pages
Academic scale is rather hundreds of millions

Denis Shestakov
27/98
Industry vs. Academia
Re-crawling
Batch crawls in academia
Regular re-crawls by industrial crawlers
Evaluation of crawled data
Crucial for corrections/improvements into crawlers
Direct evaluation by users of search engines
To some extent, artiﬁcial evaluation of academic crawls

Denis Shestakov
28/98
Web Size and Structure
Some numbers
Number of pages per host is not uniform: most hosts
contain only a few pages, others contain millions
Roughly 100 links on a page
According to Google statistics (over 4 billions pages,
2010): fetching a page takes 320KB (textual content plus
all embeddings)
Page has 10-100KB of textual (HTML) content on average
One trillion URLs known by Google/Yahoo in 2008

Denis Shestakov
29/98
Some numbers
20 million web pages in 1995 (indexed by AltaVista)
One trillion (1012) URLs known by Google/Yahoo in 2008
- ’Independent’ search engine called Majestic12
(P2P-crawling) conﬁrms one trillion items
Doesn’t mean one trillion indexed pages
Supposedly, index has dozens times less pages
Cool crawler facts: IRLbot crawler (running on one server)
downloaded 6.4 billion pages over 2 months
Throughput: 1000-1500 pages per second
Over 30 billion discovered URLs

Denis Shestakov
30/98
Bow-tie model of the Web
Illustration taken from http://dx.doi.org/doi:10.1038/35012155

Denis Shestakov
31/98
PART II: INTELLIGENT WEB CRAWLING

Denis Shestakov
32/98
Outline of Part II
Architecture of web crawler
Crawling strategies
Adaptive crawling approaches

Denis Shestakov
33/98
Architecture of Web Crawler
Crawler crawls the Web
Crawled
URLs
URL Frontier
Seed
URLs
Uncrawled Web

Denis Shestakov
34/98
Typically in a distributed fashion
Seed
URLs
Crawled
URLs
URL Frontier
crawling thread
Uncrawled Web

Denis Shestakov
35/98
URL Frontier
Include multiple pages from the same host
Must avoid trying to fetch them all at the same time
Must try to keep all crawling threads busy
Prioritization also helps

Denis Shestakov
36/98
Crawler Architecture
Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.

Denis Shestakov
37/98
Content seen?
If page fetched is already in the base/index, don’t process it
Document ﬁngerprints (shingles)
Filtering
Filter out URLs – due to ’politeness’, restrictions on crawl
Fetched robots.txt are cached to avoid fetching them
repeatedly
Duplicate URL Elimination
Check if an extracted+ﬁltered URL has been already
passed to frontier (batch crawling)
More complicated in continuous crawling (different URL
frontier implementation)

Denis Shestakov
38/98
Distributed Crawling
Run multiple crawl threads, under different processes
(often at different nodes)
Nodes can be geographically distributed
Partition hosts being crawled into nodes

Denis Shestakov
39/98
Host Splitter
Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.

Denis Shestakov
40/98
Implementation (in Perl)
Other popular languages: Java, Python, C/C++

Denis Shestakov
41/98
Crawling objectives
High web coverage
High page freshness
High content quality
High download rate
Internal and External factors
Amount of hardware (I)
Network bandwidth (I)
Rate of web growth (E)
Rate of web change (E)
Amount of malicious content (i.e., spam, duplicates) (E)

Denis Shestakov
42/98
Crawling Strategies
Download prioritization
Given a period, only a subset of web pages can be
downloaded
“Important” pages ﬁrst
Hence, need in prioritization
Ordering a queue of URLs to be visited
Strategies (ordering metrics)
Breadth-First, Depth-First
Backlink count
Best-First
PageRank
Shark-Search

Denis Shestakov
43/98
Crawling Strategies
Breadth-First, Depth-First
Breadth-First search
Implemented with
QUEUE (FIFO)
Pages with shortest
paths ﬁrst
Depth-First search
Implemented with
STACK (LIFO)

Denis Shestakov
44/98
Crawling Strategies
Pseudocode for Breadth-First

Denis Shestakov
45/98
Crawling Strategies
Backlink count
Use the link graph information
Count # of crawled pages that point to a page
Links with highest counts ﬁrst

Denis Shestakov
46/98
Crawling Strategies
Best-First
Best link selected based on some criterion
I.e., lexical similarity between topic’s keywords and link’s
source page
Similarity score sim(topic, p) assigned to outgoing links of
page p
Cosine similarity often used
where q is a topic, p is a crawled page, fkq,fkp are frequencies of term k
in q and p

Denis Shestakov
47/98
Crawling Strategies
Pseudocode for Best-First

Denis Shestakov
48/98
Crawling Strategies
PageRank
The pagerank of a page is the probability for a random
surfer (who follows links randomly) to be on this page at
any given time
A page’s score (rank) deﬁned by scores of pages with links
to this page
where p is a page, in(p) is a set of pages with links to p, out(d) is a set
of links out of d, γ are damping factor
PageRank of pages periodically recalculated using data
structure with crawled pages

Denis Shestakov
49/98
Crawling Strategies
Pseudocode for PageRank

Denis Shestakov
50/98
Crawling Strategies
Shark-Search
More emphasis on web segments where relevant pages
were found
Penalizing segments yielding a few relevant pages
A link’s score deﬁned by a link’s anchor text, text
surrounding a link (link context) and inherited score from
ancestor pages (pages pointing to a page with this link)
Parameters:
d - depth bound
r - relative importance of inherited score versus link
neighbourhood score

Denis Shestakov
51/98
Crawling Strategies
Pseudocode for Shark-Search

Denis Shestakov
52/98
Adaptive Crawling
Static vs. adaptive strategies
Strategies presented to this point are static
Not adjust in the course of the crawl
Adaptive (intelligent) crawling
InfoSpiders
Ant-based crawling

Denis Shestakov
53/98
Adaptive Crawling
InfoSpiders
Independent agents crawling in parallel
HTML parser
Noise word
remover
Stemmer
Document
relevance
assessment
Reproduction
or death
Learning
Link
assessment
and selection
HTML
document
Compact
document
representation
Document
assessment
########## $$$
########## $$$
Term
weights
Neural net
weights
Keyword
vector
Agent
representation

Denis Shestakov
54/98
Adaptive Crawling
InfoSpiders
Independent agents crawling in parallel
Each agent uses list of keywords (initialized with topic
keywords)
Neural network evaluates new links
Keywords in the vicinity a link used as input
More importance (weight) to those keywords close to a link
Maximum to words in the anchor text
Output is a numerical quality estimate for a link
Link score combined with cosine similarity score (between
agent’s keywords and a page with this link)

Denis Shestakov
55/98
Adaptive Crawling
InfoSpiders
Each agent has an energy level
Agent moves from a current to a new page if boltzmann
function returns true
where δ is diff between similarity of new and current page to agent’s
keywords
If energy level passes some threshold, an agent
reproduces
Offspring gets the half of parent’s frontier
Offspring keywords mutated (expanded) with most
frequent terms in parent’s current document

Denis Shestakov
56/98
Adaptive Crawling
Pseudocode for InfoSpiders

Denis Shestakov
57/98
Adaptive Crawling
Pseudocode for InfoSpiders (cont.)

Denis Shestakov
58/98
Adaptive Crawling
Ant-based crawling
Motivation: allow crawling agents to communicate with
each other
Follow a model of social insect collective behaviour
Ants leave the pheromone along the followed path
Other ants follow such pheromone trails
A crawler agent follows some path by visiting many URLs
At some moment, a certain amount of pheromone (weight)
can be assigned to sequence of URLs on the followed path
The amount can depend on similarity of visited pages to a
given topic

Denis Shestakov
59/98
Adaptive Crawling
Ant-based crawling
Ants (crawlers) operate in cycles
During each cycle, agents make a predeﬁned number of
moves (visits of pages)
#moves = constant ∗ #cycle
At the end of each cycle, pheromone intensity values are
updated for the followed path
Agents-ants return to their starting positions

Denis Shestakov
60/98
Adaptive Crawling
Ant-based crawling
Next link selected based on probability, which is deﬁned by
the corresponding pheromone intensity
If no pheromone information, an agent-ant moves
randomly

Denis Shestakov
61/98
Adaptive Crawling
Ant-based crawling
Probability of selecting a link
where t is the cycle number, τij (t) is pheromone value between pi and
pj and (i, l) designates the presence of a link from pi to pl
During the cycle, each ant stores the list of visited URLs
If pj was already visited, Pij(t) = 0
At the end of cycle, the list with visited URLs emptied out

Denis Shestakov
62/98
Adaptive Crawling
Implications
Strategies evaluating links based on their context (text
close by) are not directly applicable to large-scale crawling
I.e., consider crawling of 109 pages within one month
Crawl rate: around 400 documents per second
Around 40000 links per second
Every second 10000-30000 “new” links to be evaluated
(scored) and added to the frontier
Too many even for link’s anchor text evaluation only

Denis Shestakov
63/98
PART III: OPEN CHALLENGES

Denis Shestakov
64/98
Outline of Part III
Open Challenges
Collaborative web crawling
Deep Web crawling
Crawling multimedia content

Denis Shestakov
65/98
Push vs. Pull model
Web pages accessed via pull model
- HTTP is a pull protocol
That is, a client requests a page from a server
If push, a server would send a page/info to a client
Why Pull?
Pull is just easier for both parties
No ’agreement’ between provider and aggregator
No speciﬁc protocols for content providers – serving
content is enough
Perhaps pull model is the reason why the Web is
succeeded while earlier hypertext systems failed

Denis Shestakov
66/98
Why not Push?
Still pull model has several disadvantages
What are these?

Denis Shestakov
67/98
Why not Push?
Still pull model has several disadvantages
Publishing/updating content easier with push: no need in
redundant requests from crawlers
Better control over the content from providers: no need in
crawler politeness

Denis Shestakov
68/98
Crawler politeness
Content providers possess some control over crawlers
Via special protocols to deﬁne access to parts of a site
Via direct banning of agents hitting a site too often

Denis Shestakov
69/98
Crawler politeness
Robots.txt says what can(not) be crawled
Sitemaps is newer protocol specifying access restrictions
and other info
No agent should visit any URL starting with
“yoursite/notcrawldir”, except an agent called
“goodsearcher”
Example
User-agent: *
Disallow: yoursite/notcrawldir
User-agent: goodsearcher
Disallow:

Denis Shestakov
70/98
Collaborative Crawling
Main considerations
Lots of redundant crawling
To get data (often on a speciﬁc topic) need to crawl broadly
- Often lack of expertise when large crawl required
- Often, crawl a lot, use only a small subset
Too many redundant requests for content providers
Idea: have one crawler doing very broad and intensive
crawl and many parties accessing the crawled data via API
- Specify ﬁlters to select required pages
Crawler as a common service

Denis Shestakov
71/98
Some requirements
Filter language for specifying conditions
Efficient filter processing (millions filter to process)
Efficient fetching (hundreds pages per second)
Support real-time requests

Denis Shestakov
72/98
New component
Process a stream of documents against a ﬁlter index

Denis Shestakov
73/98
Filter processing architecture

Denis Shestakov
74/98
Filter processing architecture

Denis Shestakov
75/98
Based on ’The architecture and implementation of an
extensible web crawler’ by Hsieh, Gribble, Levy, 2010
(illustrations on slides 61-62 from Hsieh’s slides)
E.g., 80legs provides similar crawling services
In a way, it is reconsidering pull/push model of content
delivery on the Web

Denis Shestakov
76/98
Deep Web Crawling
Visualization of http://amazon.com by aharef.info applet

Denis Shestakov
77/98
Deep Web Crawling
In a nutshell
Problem is in yellow nodes (designating web form
elements)

●
Deep Web – part of the Web not accessible through search
engines
●
My preferred: Deep Web - content behind web search forms on
publicly available pages
●
Pages with forms themselves are typically accessible/searchable
(=crawled)
1
Content hidden behind HTML forms
Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013

Why is it important?
Large source of structured data
●
Forms present a search interface over backend databases
Significant gap in search engine coverage
●
Potentially more content that currently searchable
●
More than 10 million distinct HTML forms
●
Likely to increase and more data comes online
Size of the deep Web is unclear
●
500x figures are highly disputable
●
Number of resources is a bit simpler: ~450k databases on the Web in
2004
●
Some part of deep web content crawled/covered by search engines
●
Content can be both searched and browsed via links categorizing
content
●
Business-driven sites (e.g., shopping) typically provide both ways of
access
2Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013

Can’t pass through the forms (need to specify some values)
I.e., content is “hidden” behind search forms
●
Reason for another name for deep Web: hidden Web
To crawl/access the content behind the following is
required:
●
Identify a search form on a page
●
Fill form with proper values
●
Submit the form
●
Get the result pages
●
Extract links/data from them
Why crawlers not crawl deep Web

Approaches to deep Web crawling
Google’s Deep Web Crawl (2008)
●
Identify search forms
●
Pre-compute all interesting form submissions to each
HTML form
●
Each form submission corresponds to a distinct URL
●
Add URLs for each form submission into search
engine index
●
Allows to reuse existing search engine infrastructure
●
No aim for full coverage of a deep web resource
●
Not all forms (only GET forms) covered

Deep Web site identification
• Task: identify a search form leading to content-rich
web pages
• Surprisingly, quite challenging task
• One of the problems:
●
Detect if form is searchable

Searchable forms
Non-searchable: login forms, those that require user info
Depends: Highly-interactive forms, e.g., airline reservations
What are deep Web resources?
store locations
used cars
radio stations
patents
recipes

Deep Web site identification
• Detect if form is informational
●
Challenging for human too: e.g., assume a form is in
unknown language
• Detection by building/training binary classifiers
• Forms identified as searchable can then be classified into
domains (e.g., car search, apartment search, etc.)
●
Based on form structure (e.g., num.fields)
●
Based on form field labels
• Slow process
●
Done by specific component in offline mode

Crawling JavaScript-rich sites
• Web pages became more responsive, interactive,
user-friendly, etc.
●
Thanks to emergence of new web technologies
such as AJAX
• Besides, they led to wide spread of web applications
(RIAs)
• Challenge for crawlers as they do not
●
Manipulate client-side site
●
Take into account asynchronous communication
with the server

• Very similar to deep Web crawling challenge
●
Content is hard to crawl
●
Direct problem: AJAX/JS-enabled forms are hard to
deal with (e.g., to detect and then generate meaningful
queries)
• Web pages designed for human beings, not for
automatic programs
• JS-code should be processed to get the actual content
●
Dynamically changing
●
Lots of additional resources required (crawler should
be supplemented with JS-interpreter)

• Several techniques for AJAX crawling proposed since
2007/08
●
Focus is either on indexing and searching or on testing
RIAs
• Approach:
●
AJAX-enabled web page/application modeled using
states, events, transitions
●
Crawler uses breadth-first strategy:
●
Triggers the events on a page
●
If the DOM of a page changes then new
state/transition is added to transition graph
●
Back to initial state to invoke the next event

Denis Shestakov
89/98
Crawling Multimedia Content
The web is now multimedia platform
Images, video, audio are integral part of web pages (not
just supplementing them)
Almost all crawlers, however, consider it as a textual
repository
One reason: indexing techniques for multimedia doesn’t
reach yet the maturity required by interesting use
cases/applications
Hence, no real need to harvest multimedia
But state-of-the-art multimedia retrieval/computer vision
techniques already provide adequate search quality
E.g., search for images with a cat and a man based on
actual image content (not text around/close to image)
In case of video: set of frames plus audio (can be converted
to textual form)

Denis Shestakov
90/98
Challenges in crawling multimedia
Bigger load on web sites since ﬁles are bigger
More apparent copyright issues
More resources (e.g., bandwidth, storage place) required
from a crawler
More complicated duplicate resolving
Re-visiting policy

Denis Shestakov
91/98
Scalable Multimedia Web Observatory of ARCOMEM
project (http://www.arcomem.eu)
Focus on web archiving issues
Uses several crawlers
- ’Standard’ crawler for regular web pages
- API crawler to mine social media sources (e.g., Twitter,
Facebook, YouTube, etc.)
- Deep Web crawler able to extract information from
pre-deﬁned web sites
Data can be exported in WARC (Web ARChive) ﬁles and in
RDF

Denis Shestakov
92/98
Future Directions
Collaborative crawling, mixed pull-push model
Scalable adaptive strategies
Understanding site structure
Deep Web crawling
Semantic Web crawling
Media content crawling
Social network crawling

Denis Shestakov
93/98
References: Crawl Datasets
Use for building your crawls, web graph analysis, web data
mining tasks, etc.
ClueWeb09 Dataset:
- http://lemurproject.org/clueweb09.php/
- One billion web pages, in ten languages
- 5TBs compressed
- Hosted at several cloud services (free license required) or
a copy can be ordered on hard disks (pay for disks)
ClueWeb12:
- Almost 900 millions English web pages

Denis Shestakov
94/98
mining tasks, etc.
Common Crawl Corpus:
- See http://commoncrawl.org/data/accessing-the-data/
and http://aws.amazon.com/datasets/41740
- Around six billion web pages
- Over 100TB uncompressed
- Available as Amazon Web Services’ public dataset (pay for
processing)

Denis Shestakov
95/98
mining tasks, etc.
Internet Archive:
- See http://blog.archive.org/2012/10/26/
80-terabytes-of-archived-web-crawl-data-available-for-resea
- Crawl of 2011
- 80TB WARC ﬁles
- 2.7 billions pages
- Includes multimedia data
- Available by request

Denis Shestakov
96/98
LAW Datasets:
- http://law.dsi.unimi.it/datasets.php
- Variety of web graphs datasets (nodes, arcs, etc.) including
basic properties of recent Facebook graphs (!)
- Thoroughly studied in a number of publications
ICWSM 2011 Spinn3r Dataset:
- http://www.icwsm.org/data/
- 130mln blog posts and 230mln social media publications
- 2TB compressed
Academic Web Link Database Project:
- http://cybermetrics.wlv.ac.uk/database/
- Crawls of national universities web sites

Denis Shestakov
97/98
References: Literature
For beginners: Udacity/CS101 course;
http://www.udacity.com/overview/Course/cs101
Intermediate: Chapter 20 of Introduction to Information
Retrieval book by Manning, Raghavan, Schütze;
http://nlp.stanford.edu/IR-book/pdf/20crawl.pdf
Intermediate: Current Challenges in Web Crawling tutorial
at ICWE 2013 by Shestakov; http://www.slideshare.
net/denshe/icwe13-tutorial-webcrawling
Advanced: Web Crawling by Olston and Najork;
http://www.nowpublishers.com/product.aspx?product=
INR&doi=1500000017

Denis Shestakov
98/98
References: Literature
See relevant publications at Mendeley:
http://www.mendeley.com/groups/531771/web-crawling/
Feel free to join the group!
Check ’Deep Web’ group too
http://www.mendeley.com/groups/601801/deep-web/

Intelligent web crawling

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Intelligent web crawling

Similar to Intelligent web crawling (20)

More from Denis Shestakov

More from Denis Shestakov (6)

Recently uploaded

Recently uploaded (20)

Intelligent web crawling