SlideShare a Scribd company logo
Web Crawler
Chapter - 3
Book: Search Engines – Information Retrieval in Practice
Lecturer: Zeeshan Bhatti
All slides ©Addison Wesley, 2008
INFO-2402
INFORMATION RETRIEVAL TECHNOLOGIES
Lecture 3: 11th February, 2015
Assignment
( Due: Wednesday 18the Feb)
• Task 1: Find out At least Names of 10 unconventional Search Engines.
(not Including Google, Yahoo, MSN, Bing)
• Think Of or at least 3 random search queries
– Political
– Sports
– Technology
Search your same query string on at least 5 search Engines and analyses the
result.
Submission:
1. Query results snapshot
2. Your Analysis which search engine generated better results.
3. Describe How different the results from each search engine are? (No of same
website, Same Blogs Forums etc..) for each query category.
4. Which search engine in your perception generated results more relevant to your
requirements (i.e. what you needed, results based on your region/country).
5. Try using the same query on your friends PC and analyze the results. Do you get
Same results? Is there any difference in the Results?
Maximum in
Group (2)
Web Crawler
• Finds and downloads web pages automatically
– provides the collection for searching
• Web is huge and constantly growing
• Web is not under the control of search engine
providers
• Web pages are constantly changing
• Crawlers also used for other types of data
Retrieving Web Pages
• Every page has a unique uniform resource
locator (URL)
• Web pages are stored on web servers that use
HTTP to exchange information with client
software
• e.g.,
Retrieving Web Pages
• Web crawler client program connects to a
domain name system (DNS) server
• DNS server translates the hostname into an
internet protocol (IP) address
• Crawler then attempts to connect to server
host using specific port
• After connection, crawler sends an HTTP
request to the web server to request a page
– usually a GET request
Crawling the Web
Web Crawler
• Starts with a set of seeds, which are a set of
URLs given to it as parameters
• Seeds are added to a URL request queue
• Crawler starts fetching pages from the request
queue
• Downloaded pages are parsed to find link tags
that might contain other useful URLs to fetch
• New URLs added to the crawler’s request
queue, or frontier
• Continue until no more new URLs or disk full
Web Crawling
• Web crawlers spend a lot of time waiting for
responses to requests
• To reduce this inefficiency, web crawlers use
threads and fetch hundreds of pages at once
• Crawlers could potentially flood sites with
requests for pages
• To avoid this problem, web crawlers use
politeness policies
– e.g., delay between requests to same web server
Controlling Crawling
• Even crawling a site slowly will anger some
web server administrators, who object to any
copying of their data
• Robots.txt file can be used to control crawlers
Simple Crawler Thread
Freshness
• Web pages are constantly being added,
deleted, and modified
• Web crawler must continually revisit pages it
has already crawled to see if they have
changed in order to maintain the freshness of
the document collection
– stale copies no longer reflect the real contents of
the web pages
Freshness
• HTTP protocol has a special request type
called HEAD that makes it easy to check for
page changes
– returns information about page, not page itself
Freshness
• Not possible to constantly check all pages
– must check important pages and pages that
change frequently
• Freshness is the proportion of pages that are
fresh
• Optimizing for this metric can lead to bad
decisions, such as not crawling popular sites
• Age is a better metric
Freshness vs. Age
Age
• Expected age of a page t days after it was last
crawled:
• Web page updates follow the Poisson
distribution on average
– time until the next update is governed by an
exponential distribution
Age
• Older a page gets, the more it costs not to
crawl it
– e.g., expected age with mean change frequency
λ = 1/7 (one change per week)
Focused Crawling
• Attempts to download only those pages that
are about a particular topic
– used by vertical search applications
• Rely on the fact that pages about a topic tend
to have links to other pages on the same topic
– popular pages for a topic are typically used as
seeds
• Crawler uses text classifier to decide whether
a page is on topic
Deep Web
• Sites that are difficult for a crawler to find are
collectively referred to as the deep (or hidden)
Web
– much larger than conventional Web
• Three broad categories:
– private sites
• no incoming links, or may require log in with a valid account
– form results
• sites that can be reached only after entering some data into
a form
– scripted pages
• pages that use JavaScript, Flash, or another client-side
language to generate links
Sitemaps
• Sitemaps contain lists of URLs and data about
those URLs, such as modification time and
modification frequency
• Generated by web server administrators
• Tells crawler about pages it might not
otherwise find
• Gives crawler a hint about when to check a
page for changes
Sitemap Example
Distributed Crawling
• Three reasons to use multiple computers for
crawling
– Helps to put the crawler closer to the sites it crawls
– Reduces the number of sites the crawler has to
remember
– Reduces computing resources required
• Distributed crawler uses a hash function to assign
URLs to crawling computers
– hash function should be computed on the host part of
each URL
Desktop Crawls
• Used for desktop search and enterprise search
• Differences to web crawling:
– Much easier to find the data
– Responding quickly to updates is more important
– Must be conservative in terms of disk and CPU
usage
– Many different document formats
– Data privacy very important
Document Feeds
• Many documents are published
– created at a fixed time and rarely updated again
– e.g., news articles, blog posts, press releases,
email
• Published documents from a single source can
be ordered in a sequence called a document
feed
– new documents found by examining the end of
the feed
Document Feeds
• Two types:
– A push feed alerts the subscriber to new
documents
– A pull feed requires the subscriber to check
periodically for new documents
• Most common format for pull feeds is called
RSS
– Really Simple Syndication, RDF Site Summary, Rich
Site Summary, or ...
RSS (Rich Site Summary)
• RSS (Rich Site Summary); originally RDF Site Summary; often
called Really Simple Syndication, uses a family of standard
web feed formats to publish frequently updated information:
blog entries, news headlines, audio, video.
• An RSS document (called "feed", "web feed", or "channel")
includes full or summarized text, and metadata, like
publishing date and author's name.
• RSS feeds enable publishers to syndicate data automatically.
A standard XML file format ensures compatibility with many
different machines/programs. RSS feeds also benefit users
who want to receive timely updates from favourite websites
or to aggregate data from many sites.
RSS Example
RSS Example
RSS
• ttl tag (time to live)
– amount of time (in minutes) contents should be
cached
• RSS feeds are accessed like web pages
– using HTTP GET requests to web servers that host
them
• Easy for crawlers to parse
• Easy to find new information
Conversion
• Text is stored in hundreds of incompatible file
formats
– e.g., raw text, RTF, HTML, XML, Microsoft Word, ODF,
PDF
• Other types of files also important
– e.g., PowerPoint, Excel
• Typically use a conversion tool
– converts the document content into a tagged text
format such as HTML or XML
– retains some of the important formatting information
Character Encoding
• A character encoding is a mapping between
bits and glyphs
– i.e., getting from bits in a file to characters on a
screen
– Can be a major source of incompatibility
• ASCII is basic character encoding scheme for
English
– encodes 128 letters, numbers, special characters,
and control characters in 7 bits, extended with an
extra bit for storage in bytes
Character Encoding
• Other languages can have many more glyphs
– e.g., Chinese has more than 40,000 characters, with
over 3,000 in common use
• Many languages have multiple encoding schemes
– e.g., CJK (Chinese-Japanese-Korean) family of East
Asian languages, Hindi, Arabic
– must specify encoding
– can’t have multiple languages in one file
• Unicode developed to address encoding
problems
Unicode
• Single mapping from numbers to glyphs that
attempts to include all glyphs in common use
in all known languages
• Unicode is a mapping between numbers and
glyphs
– does not uniquely specify bits to glyph mapping!
– e.g., UTF-8, UTF-16, UTF-32
Unicode
• Proliferation of encodings comes from a need
for compatibility and to save space
– UTF-8 uses one byte for English (ASCII), as many as
4 bytes for some traditional Chinese characters
– variable length encoding, more difficult to do
string operations
– UTF-32 uses 4 bytes for every character
• Many applications use UTF-32 for internal text
encoding (fast random lookup) and UTF-8 for
disk storage (less space)
Unicode
• Arabic Unicode Scheme
Storing the Documents
• Many reasons to store converted document
text
– saves crawling time when page is not updated
– provides efficient access to text for snippet
generation, information extraction, etc.
• Database systems can provide document
storage for some applications
– web search engines use customized document
storage systems
Storing the Documents
• Requirements for document storage system:
– Random access
• request the content of a document based on its URL
• hash function based on URL is typical
– Compression and large files
• reducing storage requirements and efficient access
– Update
• handling large volumes of new and modified
documents
• adding new anchor text
Large Files
• Store many documents in large files, rather
than each document in a file
– avoids overhead in opening and closing files
– reduces seek time relative to read time
• Compound documents formats
– used to store multiple documents in a file
– e.g., TREC Web
Compression
• Text is highly redundant (or predictable)
• Compression techniques exploit this redundancy
to make files smaller without losing any of the
content
• Compression of indexes covered later
• Popular algorithms can compress HTML and XML
text by 80%
– e.g., DEFLATE (zip, gzip) and LZW (UNIX compress,
PDF)
– may compress large files in blocks to make access
faster
BigTable
• Google’s document storage system
– Customized for storing, finding, and updating web
pages
– Handles large collection sizes using inexpensive
computers
BigTable
• No query language, no complex queries to
optimize
• Only row-level transactions
• Tablets are stored in a replicated file system that
is accessible by all BigTable servers
• Any changes to a BigTable tablet are recorded to
a transaction log, which is also stored in a shared
file system
• If any tablet server crashes, another server can
immediately read the tablet data and transaction
log from the file system and take over
• Most Relational databases store their data in
files that are constantly modified. In contrast,
BigTable stores its data in immutable
(unchangeable) files.
• Once File data is written to a BigTable file, it is
never changed.
• This helps in failure recovery.
BigTable
• Logically organized into rows
• A row stores data for a single web page
• The URL, www.example.com, is the row key,
which can be used to find this row.
• Combination of a row key, a column key, and a
timestamp point to a single cell in the row
BigTable
• BigTable can have a huge number of columns
per row
– all rows have the same column groups
– not all rows have the same columns
– important for reducing disk reads to access
document data
• Rows are partitioned into tablets based on
their row keys
– simplifies determining which server is appropriate
Detecting Duplicates
• Duplicate and near-duplicate documents
occur in many situations
– Copies, versions, plagiarism, spam, mirror sites
– 30% of the web pages in a large crawl are exact or
near duplicates of pages in the other 70%
• Duplicates consume significant resources
during crawling, indexing, and search
– Little value to most users
Duplicate Detection
• Exact duplicate detection is relatively easy
• Checksum techniques
– A checksum is a value that is computed based on the
content of the document
• e.g., sum of the bytes in the document file
– Possible for files with different text to have same
checksum
• Functions such as a cyclic redundancy check
(CRC), have been developed that consider the
positions of the bytes
Near-Duplicate Detection
• More challenging task
– Are web pages with same text context but
different advertising or format near-duplicates?
• A near-duplicate document is defined using a
threshold value for some similarity measure
between pairs of documents
– e.g., document D1 is a near-duplicate of
document D2 if more than 90% of the words in
the documents are the same
Near-Duplicate Detection
• Search:
– find near-duplicates of a document D
– O(N) comparisons required
• Discovery:
– find all pairs of near-duplicate documents in the
collection
– O(N2) comparisons
• IR techniques are effective for search scenario
• For discovery, other techniques used to
generate compact representations
Assignmetn-2
Building a simple web crawler
• Assignment-2 Web crawler description

More Related Content

What's hot

Search Lucene
Search LuceneSearch Lucene
Search Lucene
Jeremy Coates
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
Adrien Grand
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
otisg
 
Lucene
LuceneLucene
Publishing website by dr. vishnu sharma
Publishing website by dr. vishnu sharmaPublishing website by dr. vishnu sharma
Publishing website by dr. vishnu sharma
Vishnu Sharma
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
dnaber
 
WorldCat Presentation
WorldCat PresentationWorldCat Presentation
WorldCat Presentation
Val MacMillan
 
Apache lucene
Apache luceneApache lucene
Apache lucene
Dr. Abhiram Gandhe
 
Solr Architecture
Solr ArchitectureSolr Architecture
Solr Architecture
Ramez Al-Fayez
 
End Note Web 2010
End Note Web 2010End Note Web 2010
End Note Web 2010
H. Stephen McMinn
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
farhan "Frank"​ mashraqi
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
Lucidworks
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
lucenerevolution
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 

What's hot (18)

Search Lucene
Search LuceneSearch Lucene
Search Lucene
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 
Lucene
LuceneLucene
Lucene
 
Publishing website by dr. vishnu sharma
Publishing website by dr. vishnu sharmaPublishing website by dr. vishnu sharma
Publishing website by dr. vishnu sharma
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Apache Multiview Vulnerability
Apache Multiview VulnerabilityApache Multiview Vulnerability
Apache Multiview Vulnerability
 
WorldCat Presentation
WorldCat PresentationWorldCat Presentation
WorldCat Presentation
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
internet workshop
internet workshopinternet workshop
internet workshop
 
Solr Architecture
Solr ArchitectureSolr Architecture
Solr Architecture
 
End Note Web 2010
End Note Web 2010End Note Web 2010
End Note Web 2010
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 

Viewers also liked

My happy new year 2013
My happy new year 2013My happy new year 2013
My happy new year 2013
alena17
 
Silent Partner Coaching Philosophy Overview
Silent Partner Coaching Philosophy OverviewSilent Partner Coaching Philosophy Overview
Silent Partner Coaching Philosophy Overviewjaylehman
 
かっこ株式会社 サービスのご案内
かっこ株式会社 サービスのご案内かっこ株式会社 サービスのご案内
かっこ株式会社 サービスのご案内
Takeo Narita
 
Enciclopedia guitarra parte 2
Enciclopedia guitarra parte 2Enciclopedia guitarra parte 2
Enciclopedia guitarra parte 2
Nando Costa
 
BUMI ed2
BUMI ed2BUMI ed2
BUMI ed2
firsandi
 
Jblm holiday service schedule
Jblm holiday service scheduleJblm holiday service schedule
Jblm holiday service scheduleMariana Heck
 
Power point sunah aqidah
Power point sunah aqidahPower point sunah aqidah
Power point sunah aqidahmachsunah
 
Silent Partner Coaching Philosophy Overview
Silent Partner Coaching Philosophy OverviewSilent Partner Coaching Philosophy Overview
Silent Partner Coaching Philosophy Overviewjaylehman
 
Sermao damontanhaxxi livreto
Sermao damontanhaxxi livretoSermao damontanhaxxi livreto
Sermao damontanhaxxi livreto
cd110859
 
Trein reizen
Trein reizenTrein reizen
Trein reizen
jacquelinecambier
 
risiko dalam aktiviti rekreasi luar
risiko dalam aktiviti rekreasi luarrisiko dalam aktiviti rekreasi luar
risiko dalam aktiviti rekreasi luar
The Nature Conservancy
 
Power point sunah aqidah
Power point sunah aqidahPower point sunah aqidah
Power point sunah aqidah
machsunah
 
Bossa nova samba_1
Bossa nova samba_1Bossa nova samba_1
Bossa nova samba_1
Nando Costa
 
Prayer 1 final
Prayer 1 finalPrayer 1 final
Prayer 1 finaljaylehman
 
Power point sunah aqidah
Power point sunah aqidahPower point sunah aqidah
Power point sunah aqidahmachsunah
 
Pop rock 1
Pop rock 1Pop rock 1
Pop rock 1
Nando Costa
 
Sermao damontanhaxxi
Sermao damontanhaxxiSermao damontanhaxxi
Sermao damontanhaxxi
cd110859
 
Ute ética y atención a la diversidad junio 2016
Ute ética y atención a la diversidad  junio 2016Ute ética y atención a la diversidad  junio 2016
Ute ética y atención a la diversidad junio 2016
UTE
 

Viewers also liked (20)

My happy new year 2013
My happy new year 2013My happy new year 2013
My happy new year 2013
 
Silent Partner Coaching Philosophy Overview
Silent Partner Coaching Philosophy OverviewSilent Partner Coaching Philosophy Overview
Silent Partner Coaching Philosophy Overview
 
かっこ株式会社 サービスのご案内
かっこ株式会社 サービスのご案内かっこ株式会社 サービスのご案内
かっこ株式会社 サービスのご案内
 
Enciclopedia guitarra parte 2
Enciclopedia guitarra parte 2Enciclopedia guitarra parte 2
Enciclopedia guitarra parte 2
 
BUMI ed2
BUMI ed2BUMI ed2
BUMI ed2
 
Jblm holiday service schedule
Jblm holiday service scheduleJblm holiday service schedule
Jblm holiday service schedule
 
Power point sunah aqidah
Power point sunah aqidahPower point sunah aqidah
Power point sunah aqidah
 
Abc words20130615
Abc words20130615Abc words20130615
Abc words20130615
 
Silent Partner Coaching Philosophy Overview
Silent Partner Coaching Philosophy OverviewSilent Partner Coaching Philosophy Overview
Silent Partner Coaching Philosophy Overview
 
Sermao damontanhaxxi livreto
Sermao damontanhaxxi livretoSermao damontanhaxxi livreto
Sermao damontanhaxxi livreto
 
Trein reizen
Trein reizenTrein reizen
Trein reizen
 
risiko dalam aktiviti rekreasi luar
risiko dalam aktiviti rekreasi luarrisiko dalam aktiviti rekreasi luar
risiko dalam aktiviti rekreasi luar
 
Ritmos
RitmosRitmos
Ritmos
 
Power point sunah aqidah
Power point sunah aqidahPower point sunah aqidah
Power point sunah aqidah
 
Bossa nova samba_1
Bossa nova samba_1Bossa nova samba_1
Bossa nova samba_1
 
Prayer 1 final
Prayer 1 finalPrayer 1 final
Prayer 1 final
 
Power point sunah aqidah
Power point sunah aqidahPower point sunah aqidah
Power point sunah aqidah
 
Pop rock 1
Pop rock 1Pop rock 1
Pop rock 1
 
Sermao damontanhaxxi
Sermao damontanhaxxiSermao damontanhaxxi
Sermao damontanhaxxi
 
Ute ética y atención a la diversidad junio 2016
Ute ética y atención a la diversidad  junio 2016Ute ética y atención a la diversidad  junio 2016
Ute ética y atención a la diversidad junio 2016
 

Similar to Info 2402 irt-chapter_3

IRT Unit_4.pptx
IRT Unit_4.pptxIRT Unit_4.pptx
IRT Unit_4.pptx
thenmozhip8
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2
Shahriar Rafee
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypresNekoGato
 
internet
internetinternet
internet
ITNet
 
Technical skills in multimedia for odl learners
Technical skills in multimedia for odl learnersTechnical skills in multimedia for odl learners
Technical skills in multimedia for odl learnersDaniel Koloseni
 
INTERNET PART1.pptx
INTERNET PART1.pptxINTERNET PART1.pptx
INTERNET PART1.pptx
BhoopendraKumar38
 
Module notes artificial intelligence and
Module notes artificial intelligence andModule notes artificial intelligence and
Module notes artificial intelligence and
bhagyavantrajapur88
 
www | HTTP | HTML - Tutorial
www | HTTP | HTML - Tutorialwww | HTTP | HTML - Tutorial
www | HTTP | HTML - Tutorial
MSA Technosoft
 
The Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search EngineThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Mehul Boricha
 
Web indexing finale
Web indexing finaleWeb indexing finale
Web indexing finaleAjit More
 
Comet: by pushing server data, we push the web forward
Comet: by pushing server data, we push the web forwardComet: by pushing server data, we push the web forward
Comet: by pushing server data, we push the web forward
NOLOH LLC.
 
Web Scale Discovery Vs Federated Search
Web Scale Discovery Vs Federated SearchWeb Scale Discovery Vs Federated Search
Web Scale Discovery Vs Federated Search
Nikesh Narayanan
 
Geek basics
Geek basicsGeek basics
DotNetNuke Urls - Best practice for administrators, editors and developers
DotNetNuke Urls - Best practice for administrators, editors and developersDotNetNuke Urls - Best practice for administrators, editors and developers
DotNetNuke Urls - Best practice for administrators, editors and developers
brchapman
 
CNIT 121: 14 Investigating Applications
CNIT 121: 14 Investigating ApplicationsCNIT 121: 14 Investigating Applications
CNIT 121: 14 Investigating Applications
Sam Bowne
 
Eba ppt rajesh
Eba ppt rajeshEba ppt rajesh
Eba ppt rajesh
RajeshP153
 
PST SC015 Chapter 3 Internet Technology (II) 2017/2018
PST SC015 Chapter 3 Internet Technology (II)  2017/2018PST SC015 Chapter 3 Internet Technology (II)  2017/2018
PST SC015 Chapter 3 Internet Technology (II) 2017/2018
Fizaril Amzari Omar
 

Similar to Info 2402 irt-chapter_3 (20)

DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
 
IRT Unit_4.pptx
IRT Unit_4.pptxIRT Unit_4.pptx
IRT Unit_4.pptx
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
internet
internetinternet
internet
 
Technical skills in multimedia for odl learners
Technical skills in multimedia for odl learnersTechnical skills in multimedia for odl learners
Technical skills in multimedia for odl learners
 
INTERNET PART1.pptx
INTERNET PART1.pptxINTERNET PART1.pptx
INTERNET PART1.pptx
 
Module notes artificial intelligence and
Module notes artificial intelligence andModule notes artificial intelligence and
Module notes artificial intelligence and
 
www | HTTP | HTML - Tutorial
www | HTTP | HTML - Tutorialwww | HTTP | HTML - Tutorial
www | HTTP | HTML - Tutorial
 
The Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search EngineThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine
 
world wide web
world wide webworld wide web
world wide web
 
Web indexing finale
Web indexing finaleWeb indexing finale
Web indexing finale
 
Comet: by pushing server data, we push the web forward
Comet: by pushing server data, we push the web forwardComet: by pushing server data, we push the web forward
Comet: by pushing server data, we push the web forward
 
Web Scale Discovery Vs Federated Search
Web Scale Discovery Vs Federated SearchWeb Scale Discovery Vs Federated Search
Web Scale Discovery Vs Federated Search
 
Geek basics
Geek basicsGeek basics
Geek basics
 
DotNetNuke Urls - Best practice for administrators, editors and developers
DotNetNuke Urls - Best practice for administrators, editors and developersDotNetNuke Urls - Best practice for administrators, editors and developers
DotNetNuke Urls - Best practice for administrators, editors and developers
 
CNIT 121: 14 Investigating Applications
CNIT 121: 14 Investigating ApplicationsCNIT 121: 14 Investigating Applications
CNIT 121: 14 Investigating Applications
 
Eba ppt rajesh
Eba ppt rajeshEba ppt rajesh
Eba ppt rajesh
 
Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
PST SC015 Chapter 3 Internet Technology (II) 2017/2018
PST SC015 Chapter 3 Internet Technology (II)  2017/2018PST SC015 Chapter 3 Internet Technology (II)  2017/2018
PST SC015 Chapter 3 Internet Technology (II) 2017/2018
 

Recently uploaded

How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
Gal Baras
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Sanjeev Rampal
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
JeyaPerumal1
 
Output determination SAP S4 HANA SAP SD CC
Output determination SAP S4 HANA SAP SD CCOutput determination SAP S4 HANA SAP SD CC
Output determination SAP S4 HANA SAP SD CC
ShahulHameed54211
 
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptxLiving-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
TristanJasperRamos
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
natyesu
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
Rogerio Filho
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
JungkooksNonexistent
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
3ipehhoa
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
3ipehhoa
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
laozhuseo02
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
3ipehhoa
 
ER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAEER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAE
Himani415946
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
nirahealhty
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
Arif0071
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
laozhuseo02
 

Recently uploaded (16)

How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 
Output determination SAP S4 HANA SAP SD CC
Output determination SAP S4 HANA SAP SD CCOutput determination SAP S4 HANA SAP SD CC
Output determination SAP S4 HANA SAP SD CC
 
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptxLiving-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
 
ER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAEER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAE
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
 

Info 2402 irt-chapter_3

  • 1. Web Crawler Chapter - 3 Book: Search Engines – Information Retrieval in Practice Lecturer: Zeeshan Bhatti All slides ©Addison Wesley, 2008 INFO-2402 INFORMATION RETRIEVAL TECHNOLOGIES Lecture 3: 11th February, 2015
  • 2. Assignment ( Due: Wednesday 18the Feb) • Task 1: Find out At least Names of 10 unconventional Search Engines. (not Including Google, Yahoo, MSN, Bing) • Think Of or at least 3 random search queries – Political – Sports – Technology Search your same query string on at least 5 search Engines and analyses the result. Submission: 1. Query results snapshot 2. Your Analysis which search engine generated better results. 3. Describe How different the results from each search engine are? (No of same website, Same Blogs Forums etc..) for each query category. 4. Which search engine in your perception generated results more relevant to your requirements (i.e. what you needed, results based on your region/country). 5. Try using the same query on your friends PC and analyze the results. Do you get Same results? Is there any difference in the Results? Maximum in Group (2)
  • 3. Web Crawler • Finds and downloads web pages automatically – provides the collection for searching • Web is huge and constantly growing • Web is not under the control of search engine providers • Web pages are constantly changing • Crawlers also used for other types of data
  • 4. Retrieving Web Pages • Every page has a unique uniform resource locator (URL) • Web pages are stored on web servers that use HTTP to exchange information with client software • e.g.,
  • 5. Retrieving Web Pages • Web crawler client program connects to a domain name system (DNS) server • DNS server translates the hostname into an internet protocol (IP) address • Crawler then attempts to connect to server host using specific port • After connection, crawler sends an HTTP request to the web server to request a page – usually a GET request
  • 7. Web Crawler • Starts with a set of seeds, which are a set of URLs given to it as parameters • Seeds are added to a URL request queue • Crawler starts fetching pages from the request queue • Downloaded pages are parsed to find link tags that might contain other useful URLs to fetch • New URLs added to the crawler’s request queue, or frontier • Continue until no more new URLs or disk full
  • 8. Web Crawling • Web crawlers spend a lot of time waiting for responses to requests • To reduce this inefficiency, web crawlers use threads and fetch hundreds of pages at once • Crawlers could potentially flood sites with requests for pages • To avoid this problem, web crawlers use politeness policies – e.g., delay between requests to same web server
  • 9. Controlling Crawling • Even crawling a site slowly will anger some web server administrators, who object to any copying of their data • Robots.txt file can be used to control crawlers
  • 11. Freshness • Web pages are constantly being added, deleted, and modified • Web crawler must continually revisit pages it has already crawled to see if they have changed in order to maintain the freshness of the document collection – stale copies no longer reflect the real contents of the web pages
  • 12. Freshness • HTTP protocol has a special request type called HEAD that makes it easy to check for page changes – returns information about page, not page itself
  • 13. Freshness • Not possible to constantly check all pages – must check important pages and pages that change frequently • Freshness is the proportion of pages that are fresh • Optimizing for this metric can lead to bad decisions, such as not crawling popular sites • Age is a better metric
  • 15. Age • Expected age of a page t days after it was last crawled: • Web page updates follow the Poisson distribution on average – time until the next update is governed by an exponential distribution
  • 16. Age • Older a page gets, the more it costs not to crawl it – e.g., expected age with mean change frequency λ = 1/7 (one change per week)
  • 17. Focused Crawling • Attempts to download only those pages that are about a particular topic – used by vertical search applications • Rely on the fact that pages about a topic tend to have links to other pages on the same topic – popular pages for a topic are typically used as seeds • Crawler uses text classifier to decide whether a page is on topic
  • 18. Deep Web • Sites that are difficult for a crawler to find are collectively referred to as the deep (or hidden) Web – much larger than conventional Web • Three broad categories: – private sites • no incoming links, or may require log in with a valid account – form results • sites that can be reached only after entering some data into a form – scripted pages • pages that use JavaScript, Flash, or another client-side language to generate links
  • 19. Sitemaps • Sitemaps contain lists of URLs and data about those URLs, such as modification time and modification frequency • Generated by web server administrators • Tells crawler about pages it might not otherwise find • Gives crawler a hint about when to check a page for changes
  • 21. Distributed Crawling • Three reasons to use multiple computers for crawling – Helps to put the crawler closer to the sites it crawls – Reduces the number of sites the crawler has to remember – Reduces computing resources required • Distributed crawler uses a hash function to assign URLs to crawling computers – hash function should be computed on the host part of each URL
  • 22. Desktop Crawls • Used for desktop search and enterprise search • Differences to web crawling: – Much easier to find the data – Responding quickly to updates is more important – Must be conservative in terms of disk and CPU usage – Many different document formats – Data privacy very important
  • 23. Document Feeds • Many documents are published – created at a fixed time and rarely updated again – e.g., news articles, blog posts, press releases, email • Published documents from a single source can be ordered in a sequence called a document feed – new documents found by examining the end of the feed
  • 24. Document Feeds • Two types: – A push feed alerts the subscriber to new documents – A pull feed requires the subscriber to check periodically for new documents • Most common format for pull feeds is called RSS – Really Simple Syndication, RDF Site Summary, Rich Site Summary, or ...
  • 25. RSS (Rich Site Summary) • RSS (Rich Site Summary); originally RDF Site Summary; often called Really Simple Syndication, uses a family of standard web feed formats to publish frequently updated information: blog entries, news headlines, audio, video. • An RSS document (called "feed", "web feed", or "channel") includes full or summarized text, and metadata, like publishing date and author's name. • RSS feeds enable publishers to syndicate data automatically. A standard XML file format ensures compatibility with many different machines/programs. RSS feeds also benefit users who want to receive timely updates from favourite websites or to aggregate data from many sites.
  • 28. RSS • ttl tag (time to live) – amount of time (in minutes) contents should be cached • RSS feeds are accessed like web pages – using HTTP GET requests to web servers that host them • Easy for crawlers to parse • Easy to find new information
  • 29. Conversion • Text is stored in hundreds of incompatible file formats – e.g., raw text, RTF, HTML, XML, Microsoft Word, ODF, PDF • Other types of files also important – e.g., PowerPoint, Excel • Typically use a conversion tool – converts the document content into a tagged text format such as HTML or XML – retains some of the important formatting information
  • 30. Character Encoding • A character encoding is a mapping between bits and glyphs – i.e., getting from bits in a file to characters on a screen – Can be a major source of incompatibility • ASCII is basic character encoding scheme for English – encodes 128 letters, numbers, special characters, and control characters in 7 bits, extended with an extra bit for storage in bytes
  • 31. Character Encoding • Other languages can have many more glyphs – e.g., Chinese has more than 40,000 characters, with over 3,000 in common use • Many languages have multiple encoding schemes – e.g., CJK (Chinese-Japanese-Korean) family of East Asian languages, Hindi, Arabic – must specify encoding – can’t have multiple languages in one file • Unicode developed to address encoding problems
  • 32. Unicode • Single mapping from numbers to glyphs that attempts to include all glyphs in common use in all known languages • Unicode is a mapping between numbers and glyphs – does not uniquely specify bits to glyph mapping! – e.g., UTF-8, UTF-16, UTF-32
  • 33. Unicode • Proliferation of encodings comes from a need for compatibility and to save space – UTF-8 uses one byte for English (ASCII), as many as 4 bytes for some traditional Chinese characters – variable length encoding, more difficult to do string operations – UTF-32 uses 4 bytes for every character • Many applications use UTF-32 for internal text encoding (fast random lookup) and UTF-8 for disk storage (less space)
  • 35. Storing the Documents • Many reasons to store converted document text – saves crawling time when page is not updated – provides efficient access to text for snippet generation, information extraction, etc. • Database systems can provide document storage for some applications – web search engines use customized document storage systems
  • 36. Storing the Documents • Requirements for document storage system: – Random access • request the content of a document based on its URL • hash function based on URL is typical – Compression and large files • reducing storage requirements and efficient access – Update • handling large volumes of new and modified documents • adding new anchor text
  • 37. Large Files • Store many documents in large files, rather than each document in a file – avoids overhead in opening and closing files – reduces seek time relative to read time • Compound documents formats – used to store multiple documents in a file – e.g., TREC Web
  • 38. Compression • Text is highly redundant (or predictable) • Compression techniques exploit this redundancy to make files smaller without losing any of the content • Compression of indexes covered later • Popular algorithms can compress HTML and XML text by 80% – e.g., DEFLATE (zip, gzip) and LZW (UNIX compress, PDF) – may compress large files in blocks to make access faster
  • 39. BigTable • Google’s document storage system – Customized for storing, finding, and updating web pages – Handles large collection sizes using inexpensive computers
  • 40. BigTable • No query language, no complex queries to optimize • Only row-level transactions • Tablets are stored in a replicated file system that is accessible by all BigTable servers • Any changes to a BigTable tablet are recorded to a transaction log, which is also stored in a shared file system • If any tablet server crashes, another server can immediately read the tablet data and transaction log from the file system and take over
  • 41. • Most Relational databases store their data in files that are constantly modified. In contrast, BigTable stores its data in immutable (unchangeable) files. • Once File data is written to a BigTable file, it is never changed. • This helps in failure recovery.
  • 42. BigTable • Logically organized into rows • A row stores data for a single web page • The URL, www.example.com, is the row key, which can be used to find this row. • Combination of a row key, a column key, and a timestamp point to a single cell in the row
  • 43. BigTable • BigTable can have a huge number of columns per row – all rows have the same column groups – not all rows have the same columns – important for reducing disk reads to access document data • Rows are partitioned into tablets based on their row keys – simplifies determining which server is appropriate
  • 44. Detecting Duplicates • Duplicate and near-duplicate documents occur in many situations – Copies, versions, plagiarism, spam, mirror sites – 30% of the web pages in a large crawl are exact or near duplicates of pages in the other 70% • Duplicates consume significant resources during crawling, indexing, and search – Little value to most users
  • 45. Duplicate Detection • Exact duplicate detection is relatively easy • Checksum techniques – A checksum is a value that is computed based on the content of the document • e.g., sum of the bytes in the document file – Possible for files with different text to have same checksum • Functions such as a cyclic redundancy check (CRC), have been developed that consider the positions of the bytes
  • 46. Near-Duplicate Detection • More challenging task – Are web pages with same text context but different advertising or format near-duplicates? • A near-duplicate document is defined using a threshold value for some similarity measure between pairs of documents – e.g., document D1 is a near-duplicate of document D2 if more than 90% of the words in the documents are the same
  • 47. Near-Duplicate Detection • Search: – find near-duplicates of a document D – O(N) comparisons required • Discovery: – find all pairs of near-duplicate documents in the collection – O(N2) comparisons • IR techniques are effective for search scenario • For discovery, other techniques used to generate compact representations
  • 48. Assignmetn-2 Building a simple web crawler • Assignment-2 Web crawler description