SlideShare a Scribd company logo
1 of 50
Web Search Overview & CrawlingWeb Search Overview & Crawling
By
SATHISHKUMAR G
(sathishsak111@gmail.com)
Web Search and Mining
Web Search Overview & CrawlingWeb Search Overview & Crawling
Algorithmic results.
Paid
Search Ads
Web Search Overview & CrawlingWeb Search Overview & Crawling
Search and Information Retrieval
 Search on the Web is a daily activity for many people
throughout the world
 Search and communication are most popular uses of
the computer
 Applications involving search are everywhere
 The field of computer science that is most involved
with R&D for search is information retrieval (IR)
Search & IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Information Retrieval
 “Information retrieval is a field concerned with the
structure, analysis, organization, storage, searching,
and retrieval of information.” (Salton, 1968)
 General definition that can be applied to many types
of information and search applications
 Primary focus of IR since the 50s has been on text
and documents
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
What is a Document?
 Examples:
 web pages, email, books, news stories, scholarly papers,
text messages, Word™, Powerpoint™, PDF, forum postings,
patents, etc.
 Common properties
 Significant text content
 Some structure (e.g., title, author, date for papers;
subject, sender, destination for email)
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Documents vs. Database Records
 Database records (or tuples in relational databases)
are typically made up of well-defined fields (or
attributes)
 e.g., bank records with account numbers, balances,
names, addresses, social security numbers, dates of birth,
etc.
 Easy to compare fields with well-defined semantics
to queries in order to find matches
 Text is more difficult
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Documents vs. Records
 Example bank database query
 Find records with balance > $50,000 in branches located in
Amherst, MA.
 Matches easily found by comparison with field values of
records
 Example search engine query
 bank scandals in western mass
 This text must be compared to the text of entire news
stories
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Comparing Text
 Comparing the query text to the document text and
determining what is a good match is the core issue of
information retrieval
 Exact matching of words is not enough
 Many different ways to write the same thing in a “natural
language” like English
 e.g., does a news story containing the text “bank director
in Amherst steals funds” match the query?
 Some stories will be better matches than others
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Dimensions of IR
 IR is more than just text, and more than just web
search
 although these are central
 People doing IR work with different media, different
types of search applications, and different tasks
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Other Media
 New applications increasingly involve new media
 e.g., video, photos, music, speech
 Like text, content is difficult to describe and compare
 text may be used to represent them (e.g. tags)
 IR approaches to search and evaluation are
appropriate
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Dimensions of IR
Content Applications Tasks
Text Web search Ad hoc search
Images Vertical search Filtering
Video Enterprise search Classification
Scanned docs Desktop search Question answering
Audio Forum search
Music P2P search
Literature search
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
IR Tasks
 Ad-hoc search
 Find relevant documents for an arbitrary text query
 Filtering
 Identify relevant user profiles for a new document
 Classification
 Identify relevant labels for documents
 Question answering
 Give a specific answer to a question
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Big Issues in IR
 Relevance
 What is it?
 Simple (and simplistic) definition:
A relevant document contains the information that a
person was looking for when they submitted a query to
the search engine
 Many factors influence a person’s decision about what is
relevant: e.g., task, context, novelty, style
 Topical relevance (same topic) vs. user relevance
(everything else)
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Big Issues in IR
 Relevance
 Retrieval models define a view of relevance
 Ranking algorithms used in search engines are based on
retrieval models
 Most models describe statistical properties of text rather
than linguistic
 i.e., counting simple text features such as words instead of parsing
and analyzing the sentences
 Statistical approach to text processing started with Luhn in the 50s
 Linguistic features can be part of a statistical model
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Big Issues in IR
 Evaluation
 Experimental procedures and measures for comparing
system output with user expectations
 Originated in Cranfield experiments in the 60s
 Typically use test collection of documents, queries, and
relevance judgments
 Most commonly used are TREC collections
 Recall and precision are two examples of effectiveness
measures
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Big Issues in IR
 Users and Information Needs
 Search evaluation is user-centered
 Keyword queries are often poor descriptions of actual
information needs
 Interaction and context are important for understanding
user intent
 Query refinement techniques such as query expansion,
query suggestion, relevance feedback improve ranking
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
IR and Search Engines
 A search engine is the practical application of
information retrieval techniques to large scale text
collections
 Web search engines are best-known examples, but
many others
 Open source search engines are important for research
and development
 e.g., Lucene, Lemur/Indri, Galago
 Big issues include main IR issues but also some
others
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
IR and Search Engines
Relevance
-Effective ranking
Evaluation
-Testing and measuring
Information needs
-User interaction
Performance
-Efficient search and indexing
Incorporating new data
-Coverage and freshness
Scalability
-Growing with data and users
Adaptability
-Tuning for applications
Specific problems
-e.g. Spam
Information Retrieval Search Engines
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Search Engine Issues
 Performance
 Measuring and improving the efficiency of search
 e.g., reducing response time, increasing query throughput,
increasing indexing speed
 Indexes are data structures designed to improve search
efficiency
 designing and implementing them are major issues for search
engines
Search Engine
Web Search Overview & CrawlingWeb Search Overview & Crawling
Search Engine Issues
 Dynamic data (Incorporating new data)
 The “collection” for most real applications is constantly
changing in terms of updates, additions, deletions
 e.g., web pages
 Acquiring or “crawling” the documents is a major task
 Typical measures are coverage (how much has been indexed) and
freshness (how recently was it indexed)
 Updating the indexes while processing queries is also a
design issue
Search Engine
Web Search Overview & CrawlingWeb Search Overview & Crawling
Search Engine Issues
 Scalability
 Making everything work with millions of users every day,
and many terabytes of documents
 Distributed processing is essential
 Adaptability
 Changing and tuning search engine components such as
ranking algorithm, indexing strategy, interface for
different applications
Search Engine
Web Search Overview & CrawlingWeb Search Overview & Crawling
Search Engine Issues
 Spam
 For Web search, spam in all its forms is one of the major
issues
 Affects the efficiency of search engines and, more
seriously, the effectiveness of the results
 Many types of spam
 e.g. spamdexing or term spam, link spam, “optimization”
 New subfield called adversarial IR, since spammers are
“adversaries” with different goals
Search Engine
Web Search Overview & CrawlingWeb Search Overview & Crawling
Architecture of SE
How do search engines like Google work?
Search Engine
Web Search Overview & CrawlingWeb Search Overview & Crawling
Algorithmic results.
Paid
Search Ads
Search Engine
Web Search Overview & CrawlingWeb Search Overview & Crawling
Architecture
The Web
Ad indexes
Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)
Miele, Inc -- Anything else is a compromise
At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances.
Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ...
www.miele.com/ - 20k - Cached - Similar pages
Miele
Welcome to Miele, the home of the very best appliances and kitchens in the world.
www.miele.co.uk/ - 3k - Cached - Similar pages
Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this
page ]
Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit
...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes.
www.miele.de/ - 10k - Cached - Similar pages
Herzlich willkommen bei Miele Österreich - [ Translate this page ]
Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch
weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ...
www.miele.at/ - 3k - Cached - Similar pages
Sponsored Links
CG Appliance Express
Discount Appliances (650) 756-3931
Same Day Certified Installation
www.cgappliance.com
San Francisco-Oakland-San Jose,
CA
Miele Vacuum Cleaners
Miele Vacuums- Complete Selection
Free Shipping!
www.vacuums.com
Miele Vacuum Cleaners
Miele-Free Air shipping!
All models. Helpful advice.
www.best-vacuum.com
Web spider
Indexer
Indexes
Search
User
Search Engine
Web Search Overview & CrawlingWeb Search Overview & Crawling
Indexing Process
Search Engine
Web Search Overview & CrawlingWeb Search Overview & Crawling
Indexing Process
 Text acquisition
 identifies and stores documents for indexing
 Text transformation
 transforms documents into index terms or features
 Index creation
 takes index terms and creates data structures (indexes) to
support fast searching
Search Engine
Web Search Overview & CrawlingWeb Search Overview & Crawling
Query Process
Search Engine
Web Search Overview & CrawlingWeb Search Overview & Crawling
Query Process
 User interaction
 supports creation and refinement of query, display of
results
 Ranking
 uses query and indexes to generate ranked list of
documents
 Evaluation
 monitors and measures effectiveness and efficiency
(primarily offline)
Search Engine
Web Search Overview & CrawlingWeb Search Overview & Crawling
Details: Text Acquisition
 Crawler
 Identifies and acquires documents for search engine
 Many types – web, enterprise, desktop
 Web crawlers follow links to find documents
 Must efficiently find huge numbers of web pages (coverage) and
keep them up-to-date (freshness)
 Single site crawlers for site search
 Topical or focused crawlers for vertical search
 Document crawlers for enterprise and desktop search
 Follow links and scan directories
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Web Crawler
 Starts with a set of seeds, which are a set of URLs given to it
as parameters
 Seeds are added to a URL request queue
 Crawler starts fetching pages from the request queue
 Downloaded pages are parsed to find link tags that might
contain other useful URLs to fetch
 New URLs added to the crawler’s request queue, or frontier
 Continue until no more new URLs or disk full
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Crawling picture
Web
URLs crawled
and parsed
URLs frontier
Unseen Web
Seed
pages
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Crawling the Web
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Text Acquisition
 Feeds
 Real-time streams of documents
 e.g., web feeds for news, blogs, video, radio, tv
 RSS is common standard
 RSS “reader” can provide new XML documents to search engine
 Conversion
 Convert variety of documents into a consistent text plus
metadata format
 e.g. HTML, XML, Word, PDF, etc. → XML
 Convert text encoding for different languages
 Using a Unicode standard like UTF-8
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Text Acquisition
 Document data store
 Stores text, metadata, and other related content for
documents
 Metadata is information about document such as type and
creation date
 Other content includes links, anchor text
 Provides fast access to document contents for search
engine components
 e.g. result list generation
 Could use relational database system
 More typically, a simpler, more efficient storage system is used
due to huge numbers of documents
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Text Transformation
 Parser
 Processing the sequence of text tokens in the document to
recognize structural elements
 e.g., titles, links, headings, etc.
 Tokenizer recognizes “words” in the text
 must consider issues like capitalization, hyphens, apostrophes,
non-alpha characters, separators
 Markup languages such as HTML, XML often used to specify
structure
 Tags used to specify document elements
 E.g., <h2> Overview </h2>
 Document parser uses syntax of markup language (or other
formatting) to identify structure
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Text Transformation
 Stopping
 Remove common words
 e.g., “and”, “or”, “the”, “in”
 Some impact on efficiency and effectiveness
 Can be a problem for some queries
 Stemming
 Group words derived from a common stem
 e.g., “computer”, “computers”, “computing”, “compute”
 Usually effective, but not for all queries
 Benefits vary for different languages
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Text Transformation
 Link Analysis
 Makes use of links and anchor text in web pages
 Link analysis identifies popularity and community
information
 e.g., PageRank
 Anchor text can significantly enhance the representation
of pages pointed to by links
 Significant impact on web search
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Text Transformation
 Information Extraction
 Identify classes of index terms that are important for some
applications
 e.g., named entity recognizers identify classes such as
people, locations, companies, dates, etc.
 Classifier
 Identifies class-related metadata for documents
 i.e., assigns labels to documents
 e.g., topics, reading levels, sentiment, genre
 Use depends on application
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Index Creation
 Document Statistics
 Gathers counts and positions of words and other features
 Used in ranking algorithm
 Weighting
 Computes weights for index terms
 Used in ranking algorithm
 e.g., tf.idf weight
 Combination of term frequency in document and inverse
document frequency in the collection
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Index Creation
 Inversion
 Core of indexing process
 Converts document-term information to term-document
for indexing
 Difficult for very large numbers of documents
 Format of inverted file is designed for fast query
processing
 Must also handle updates
 Compression used for efficiency
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Index Creation
 Index Distribution
 Distributes indexes across multiple computers and/or
multiple sites
 Essential for fast query processing with large numbers of
documents
 Many variations
 Document distribution, term distribution, replication
 P2P and distributed IR involve search across multiple sites
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
User Interaction
 Query input
 Provides interface and parser for query language
 Most web queries are very simple, other applications may
use forms
 Query language used to describe more complex queries
and results of query transformation
 e.g., Boolean queries
 similar to SQL language used in database applications
 IR query languages also allow content and structure specifications,
but focus on content
Query Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
User Interaction
 Query transformation
 Improves initial query, both before and after initial search
 Includes text transformation techniques used for
documents
 Spell checking and query suggestion provide alternatives
to original query
 Query expansion and relevance feedback modify the
original query with additional terms
Query Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
User Interaction
 Results output
 Constructs the display of ranked documents for a query
 Generates snippets to show how queries match
documents
 Highlights important words and passages
 Retrieves appropriate advertising in many applications
 May provide clustering and other visualization tools
Query Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Ranking
 Scoring
 Calculates scores for documents using a ranking algorithm
 Core component of search engine
 Basic form of score is ∑ qi di
 qi and di are query and document term weights for term i
 Many variations of ranking algorithms and retrieval
models
Query Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Ranking
 Performance optimization
 Designing ranking algorithms for efficient processing
 Term-at-a time vs. document-at-a-time processing
 Safe vs. unsafe optimizations
 Distribution
 Processing queries in a distributed environment
 Query broker distributes queries and assembles results
 Caching is a form of distributed searching
Query Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Evaluation
 Logging
 Logging user queries and interaction is crucial for
improving search effectiveness and efficiency
 Query logs and clickthrough data used for query
suggestion, spell checking, query caching, ranking,
advertising search, and other components
 Ranking analysis
 Measuring and tuning ranking effectiveness
 Performance analysis
 Measuring and tuning system efficiency
Query Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
How Does It Really Work?
 This course explains these components of a search
engine in more detail
 Often many possible approaches and techniques for a
given component
 Focus is on the most important alternatives
 i.e., explain a small number of approaches in detail rather than
many approaches
 “Importance” based on research results and use in actual search
engines
 Alternatives described in references
Web Search Overview & CrawlingWeb Search Overview & Crawling
Thank you

More Related Content

What's hot

How search engines work
How search engines workHow search engines work
How search engines workChinna Botla
 
Apache Mahout Architecture Overview
Apache Mahout Architecture OverviewApache Mahout Architecture Overview
Apache Mahout Architecture OverviewStefano Dalla Palma
 
Web Mining Presentation Final
Web Mining Presentation FinalWeb Mining Presentation Final
Web Mining Presentation FinalEr. Jagrat Gupta
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text MiningHemant Sharma
 
Introduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupIntroduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupTushar Mittal
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMSai Kumar Ale
 
Google Analytics
Google AnalyticsGoogle Analytics
Google AnalyticsDaniel Ku
 
Ranking algorithms
Ranking algorithmsRanking algorithms
Ranking algorithmsAnkit Raj
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slidesmahavir_a
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawlerishmecse13
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 

What's hot (20)

How search engines work
How search engines workHow search engines work
How search engines work
 
Gaurav web mining
Gaurav web miningGaurav web mining
Gaurav web mining
 
Apache Mahout Architecture Overview
Apache Mahout Architecture OverviewApache Mahout Architecture Overview
Apache Mahout Architecture Overview
 
Web mining
Web miningWeb mining
Web mining
 
Web Mining Presentation Final
Web Mining Presentation FinalWeb Mining Presentation Final
Web Mining Presentation Final
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text Mining
 
Web mining
Web miningWeb mining
Web mining
 
Introduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupIntroduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful Soup
 
Web Crawling & Crawler
Web Crawling & CrawlerWeb Crawling & Crawler
Web Crawling & Crawler
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEM
 
Google Analytics
Google AnalyticsGoogle Analytics
Google Analytics
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Ranking algorithms
Ranking algorithmsRanking algorithms
Ranking algorithms
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slides
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Text clustering
Text clusteringText clustering
Text clustering
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Web usage mining
Web usage miningWeb usage mining
Web usage mining
 
Search Engine
Search EngineSearch Engine
Search Engine
 

Similar to Web Search and Mining

Using Search Analytics to Diagnose What’s Ailing your Information Architecture
Using Search Analytics to Diagnose What’s Ailing your Information ArchitectureUsing Search Analytics to Diagnose What’s Ailing your Information Architecture
Using Search Analytics to Diagnose What’s Ailing your Information ArchitectureLouis Rosenfeld
 
A survey on various architectures, models and methodologies for information r...
A survey on various architectures, models and methodologies for information r...A survey on various architectures, models and methodologies for information r...
A survey on various architectures, models and methodologies for information r...IAEME Publication
 
`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areasinventionjournals
 
Search Analytics for Fun and Profit
Search Analytics for Fun and ProfitSearch Analytics for Fun and Profit
Search Analytics for Fun and ProfitLouis Rosenfeld
 
Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engineankur881120
 
Search Engines Other than Google
Search Engines Other than GoogleSearch Engines Other than Google
Search Engines Other than GoogleDr Trivedi
 
SEO Basics - SEO Company in India
SEO Basics - SEO Company in IndiaSEO Basics - SEO Company in India
SEO Basics - SEO Company in Indiaannakoch32
 
Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Marianne Sweeny
 
Search Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your CustomersSearch Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your Customersrichwig
 
SEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in IndiaSEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in Indiaannakoch32
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”voginip
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”VOGIN-academie
 
Searchland: Search quality for Beginners
Searchland: Search quality for BeginnersSearchland: Search quality for Beginners
Searchland: Search quality for BeginnersValeria de Paiva
 
Search Analytics: Diagnosing what ails your site
Search Analytics:  Diagnosing what ails your siteSearch Analytics:  Diagnosing what ails your site
Search Analytics: Diagnosing what ails your siteLouis Rosenfeld
 
Role of Text Mining in Search Engine
Role of Text Mining in Search EngineRole of Text Mining in Search Engine
Role of Text Mining in Search EngineJay R Modi
 
INTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVALINTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVALsathish sak
 

Similar to Web Search and Mining (20)

Using Search Analytics to Diagnose What’s Ailing your Information Architecture
Using Search Analytics to Diagnose What’s Ailing your Information ArchitectureUsing Search Analytics to Diagnose What’s Ailing your Information Architecture
Using Search Analytics to Diagnose What’s Ailing your Information Architecture
 
A survey on various architectures, models and methodologies for information r...
A survey on various architectures, models and methodologies for information r...A survey on various architectures, models and methodologies for information r...
A survey on various architectures, models and methodologies for information r...
 
`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas
 
Search Analytics for Fun and Profit
Search Analytics for Fun and ProfitSearch Analytics for Fun and Profit
Search Analytics for Fun and Profit
 
Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engine
 
CS8080 IRT UNIT I NOTES.pdf
CS8080 IRT UNIT I  NOTES.pdfCS8080 IRT UNIT I  NOTES.pdf
CS8080 IRT UNIT I NOTES.pdf
 
CS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdfCS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdf
 
Chap1
Chap1Chap1
Chap1
 
Search Engines Other than Google
Search Engines Other than GoogleSearch Engines Other than Google
Search Engines Other than Google
 
SEO Basics - SEO Company in India
SEO Basics - SEO Company in IndiaSEO Basics - SEO Company in India
SEO Basics - SEO Company in India
 
Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3
 
Search Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your CustomersSearch Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your Customers
 
SEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in IndiaSEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in India
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
Searchland: Search quality for Beginners
Searchland: Search quality for BeginnersSearchland: Search quality for Beginners
Searchland: Search quality for Beginners
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Search Analytics: Diagnosing what ails your site
Search Analytics:  Diagnosing what ails your siteSearch Analytics:  Diagnosing what ails your site
Search Analytics: Diagnosing what ails your site
 
Role of Text Mining in Search Engine
Role of Text Mining in Search EngineRole of Text Mining in Search Engine
Role of Text Mining in Search Engine
 
INTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVALINTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVAL
 

More from sathish sak

TRANSPARENT CONCRE
TRANSPARENT CONCRETRANSPARENT CONCRE
TRANSPARENT CONCREsathish sak
 
Stationary Waves
Stationary WavesStationary Waves
Stationary Wavessathish sak
 
Electrical Activity of the Heart
Electrical Activity of the HeartElectrical Activity of the Heart
Electrical Activity of the Heartsathish sak
 
Electrical Activity of the Heart
Electrical Activity of the HeartElectrical Activity of the Heart
Electrical Activity of the Heartsathish sak
 
Software process life cycles
Software process life cyclesSoftware process life cycles
Software process life cycles sathish sak
 
Digital Logic Circuits
Digital Logic CircuitsDigital Logic Circuits
Digital Logic Circuitssathish sak
 
Real-Time Scheduling
Real-Time SchedulingReal-Time Scheduling
Real-Time Schedulingsathish sak
 
Real-Time Signal Processing: Implementation and Application
Real-Time Signal Processing:  Implementation and ApplicationReal-Time Signal Processing:  Implementation and Application
Real-Time Signal Processing: Implementation and Applicationsathish sak
 
DIGITAL SIGNAL PROCESSOR OVERVIEW
DIGITAL SIGNAL PROCESSOR OVERVIEWDIGITAL SIGNAL PROCESSOR OVERVIEW
DIGITAL SIGNAL PROCESSOR OVERVIEWsathish sak
 
FRACTAL ROBOTICS
FRACTAL  ROBOTICSFRACTAL  ROBOTICS
FRACTAL ROBOTICSsathish sak
 
POWER GENERATION OF THERMAL POWER PLANT
POWER GENERATION OF THERMAL POWER PLANTPOWER GENERATION OF THERMAL POWER PLANT
POWER GENERATION OF THERMAL POWER PLANTsathish sak
 
mathematics application fiels of engineering
mathematics application fiels of engineeringmathematics application fiels of engineering
mathematics application fiels of engineeringsathish sak
 
ENVIRONMENTAL POLLUTION
ENVIRONMENTALPOLLUTIONENVIRONMENTALPOLLUTION
ENVIRONMENTAL POLLUTIONsathish sak
 

More from sathish sak (20)

TRANSPARENT CONCRE
TRANSPARENT CONCRETRANSPARENT CONCRE
TRANSPARENT CONCRE
 
Stationary Waves
Stationary WavesStationary Waves
Stationary Waves
 
Electrical Activity of the Heart
Electrical Activity of the HeartElectrical Activity of the Heart
Electrical Activity of the Heart
 
Electrical Activity of the Heart
Electrical Activity of the HeartElectrical Activity of the Heart
Electrical Activity of the Heart
 
Software process life cycles
Software process life cyclesSoftware process life cycles
Software process life cycles
 
Digital Logic Circuits
Digital Logic CircuitsDigital Logic Circuits
Digital Logic Circuits
 
Real-Time Scheduling
Real-Time SchedulingReal-Time Scheduling
Real-Time Scheduling
 
Real-Time Signal Processing: Implementation and Application
Real-Time Signal Processing:  Implementation and ApplicationReal-Time Signal Processing:  Implementation and Application
Real-Time Signal Processing: Implementation and Application
 
DIGITAL SIGNAL PROCESSOR OVERVIEW
DIGITAL SIGNAL PROCESSOR OVERVIEWDIGITAL SIGNAL PROCESSOR OVERVIEW
DIGITAL SIGNAL PROCESSOR OVERVIEW
 
FRACTAL ROBOTICS
FRACTAL  ROBOTICSFRACTAL  ROBOTICS
FRACTAL ROBOTICS
 
Electro bike
Electro bikeElectro bike
Electro bike
 
ROBOTIC SURGERY
ROBOTIC SURGERYROBOTIC SURGERY
ROBOTIC SURGERY
 
POWER GENERATION OF THERMAL POWER PLANT
POWER GENERATION OF THERMAL POWER PLANTPOWER GENERATION OF THERMAL POWER PLANT
POWER GENERATION OF THERMAL POWER PLANT
 
mathematics application fiels of engineering
mathematics application fiels of engineeringmathematics application fiels of engineering
mathematics application fiels of engineering
 
Plastics…
Plastics…Plastics…
Plastics…
 
ENGINEERING
ENGINEERINGENGINEERING
ENGINEERING
 
ENVIRONMENTAL POLLUTION
ENVIRONMENTALPOLLUTIONENVIRONMENTALPOLLUTION
ENVIRONMENTAL POLLUTION
 
RFID TECHNOLOGY
RFID TECHNOLOGYRFID TECHNOLOGY
RFID TECHNOLOGY
 
green chemistry
green chemistrygreen chemistry
green chemistry
 
NANOTECHNOLOGY
  NANOTECHNOLOGY	  NANOTECHNOLOGY
NANOTECHNOLOGY
 

Recently uploaded

Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Lucknow
 
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With RoomVIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Roomishabajaj13
 
Complet Documnetation for Smart Assistant Application for Disabled Person
Complet Documnetation   for Smart Assistant Application for Disabled PersonComplet Documnetation   for Smart Assistant Application for Disabled Person
Complet Documnetation for Smart Assistant Application for Disabled Personfurqan222004
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一Fs
 
VIP Kolkata Call Girl Kestopur 👉 8250192130 Available With Room
VIP Kolkata Call Girl Kestopur 👉 8250192130  Available With RoomVIP Kolkata Call Girl Kestopur 👉 8250192130  Available With Room
VIP Kolkata Call Girl Kestopur 👉 8250192130 Available With Roomdivyansh0kumar0
 
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts servicevipmodelshub1
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一Fs
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一Fs
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 

Recently uploaded (20)

Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
 
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With RoomVIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
 
Complet Documnetation for Smart Assistant Application for Disabled Person
Complet Documnetation   for Smart Assistant Application for Disabled PersonComplet Documnetation   for Smart Assistant Application for Disabled Person
Complet Documnetation for Smart Assistant Application for Disabled Person
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
 
VIP Kolkata Call Girl Kestopur 👉 8250192130 Available With Room
VIP Kolkata Call Girl Kestopur 👉 8250192130  Available With RoomVIP Kolkata Call Girl Kestopur 👉 8250192130  Available With Room
VIP Kolkata Call Girl Kestopur 👉 8250192130 Available With Room
 
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 

Web Search and Mining

  • 1. Web Search Overview & CrawlingWeb Search Overview & Crawling By SATHISHKUMAR G (sathishsak111@gmail.com) Web Search and Mining
  • 2. Web Search Overview & CrawlingWeb Search Overview & Crawling Algorithmic results. Paid Search Ads
  • 3. Web Search Overview & CrawlingWeb Search Overview & Crawling Search and Information Retrieval  Search on the Web is a daily activity for many people throughout the world  Search and communication are most popular uses of the computer  Applications involving search are everywhere  The field of computer science that is most involved with R&D for search is information retrieval (IR) Search & IR
  • 4. Web Search Overview & CrawlingWeb Search Overview & Crawling Information Retrieval  “Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.” (Salton, 1968)  General definition that can be applied to many types of information and search applications  Primary focus of IR since the 50s has been on text and documents IR
  • 5. Web Search Overview & CrawlingWeb Search Overview & Crawling What is a Document?  Examples:  web pages, email, books, news stories, scholarly papers, text messages, Word™, Powerpoint™, PDF, forum postings, patents, etc.  Common properties  Significant text content  Some structure (e.g., title, author, date for papers; subject, sender, destination for email) IR
  • 6. Web Search Overview & CrawlingWeb Search Overview & Crawling Documents vs. Database Records  Database records (or tuples in relational databases) are typically made up of well-defined fields (or attributes)  e.g., bank records with account numbers, balances, names, addresses, social security numbers, dates of birth, etc.  Easy to compare fields with well-defined semantics to queries in order to find matches  Text is more difficult IR
  • 7. Web Search Overview & CrawlingWeb Search Overview & Crawling Documents vs. Records  Example bank database query  Find records with balance > $50,000 in branches located in Amherst, MA.  Matches easily found by comparison with field values of records  Example search engine query  bank scandals in western mass  This text must be compared to the text of entire news stories IR
  • 8. Web Search Overview & CrawlingWeb Search Overview & Crawling Comparing Text  Comparing the query text to the document text and determining what is a good match is the core issue of information retrieval  Exact matching of words is not enough  Many different ways to write the same thing in a “natural language” like English  e.g., does a news story containing the text “bank director in Amherst steals funds” match the query?  Some stories will be better matches than others IR
  • 9. Web Search Overview & CrawlingWeb Search Overview & Crawling Dimensions of IR  IR is more than just text, and more than just web search  although these are central  People doing IR work with different media, different types of search applications, and different tasks IR
  • 10. Web Search Overview & CrawlingWeb Search Overview & Crawling Other Media  New applications increasingly involve new media  e.g., video, photos, music, speech  Like text, content is difficult to describe and compare  text may be used to represent them (e.g. tags)  IR approaches to search and evaluation are appropriate IR
  • 11. Web Search Overview & CrawlingWeb Search Overview & Crawling Dimensions of IR Content Applications Tasks Text Web search Ad hoc search Images Vertical search Filtering Video Enterprise search Classification Scanned docs Desktop search Question answering Audio Forum search Music P2P search Literature search IR
  • 12. Web Search Overview & CrawlingWeb Search Overview & Crawling IR Tasks  Ad-hoc search  Find relevant documents for an arbitrary text query  Filtering  Identify relevant user profiles for a new document  Classification  Identify relevant labels for documents  Question answering  Give a specific answer to a question IR
  • 13. Web Search Overview & CrawlingWeb Search Overview & Crawling Big Issues in IR  Relevance  What is it?  Simple (and simplistic) definition: A relevant document contains the information that a person was looking for when they submitted a query to the search engine  Many factors influence a person’s decision about what is relevant: e.g., task, context, novelty, style  Topical relevance (same topic) vs. user relevance (everything else) IR
  • 14. Web Search Overview & CrawlingWeb Search Overview & Crawling Big Issues in IR  Relevance  Retrieval models define a view of relevance  Ranking algorithms used in search engines are based on retrieval models  Most models describe statistical properties of text rather than linguistic  i.e., counting simple text features such as words instead of parsing and analyzing the sentences  Statistical approach to text processing started with Luhn in the 50s  Linguistic features can be part of a statistical model IR
  • 15. Web Search Overview & CrawlingWeb Search Overview & Crawling Big Issues in IR  Evaluation  Experimental procedures and measures for comparing system output with user expectations  Originated in Cranfield experiments in the 60s  Typically use test collection of documents, queries, and relevance judgments  Most commonly used are TREC collections  Recall and precision are two examples of effectiveness measures IR
  • 16. Web Search Overview & CrawlingWeb Search Overview & Crawling Big Issues in IR  Users and Information Needs  Search evaluation is user-centered  Keyword queries are often poor descriptions of actual information needs  Interaction and context are important for understanding user intent  Query refinement techniques such as query expansion, query suggestion, relevance feedback improve ranking IR
  • 17. Web Search Overview & CrawlingWeb Search Overview & Crawling IR and Search Engines  A search engine is the practical application of information retrieval techniques to large scale text collections  Web search engines are best-known examples, but many others  Open source search engines are important for research and development  e.g., Lucene, Lemur/Indri, Galago  Big issues include main IR issues but also some others IR
  • 18. Web Search Overview & CrawlingWeb Search Overview & Crawling IR and Search Engines Relevance -Effective ranking Evaluation -Testing and measuring Information needs -User interaction Performance -Efficient search and indexing Incorporating new data -Coverage and freshness Scalability -Growing with data and users Adaptability -Tuning for applications Specific problems -e.g. Spam Information Retrieval Search Engines IR
  • 19. Web Search Overview & CrawlingWeb Search Overview & Crawling Search Engine Issues  Performance  Measuring and improving the efficiency of search  e.g., reducing response time, increasing query throughput, increasing indexing speed  Indexes are data structures designed to improve search efficiency  designing and implementing them are major issues for search engines Search Engine
  • 20. Web Search Overview & CrawlingWeb Search Overview & Crawling Search Engine Issues  Dynamic data (Incorporating new data)  The “collection” for most real applications is constantly changing in terms of updates, additions, deletions  e.g., web pages  Acquiring or “crawling” the documents is a major task  Typical measures are coverage (how much has been indexed) and freshness (how recently was it indexed)  Updating the indexes while processing queries is also a design issue Search Engine
  • 21. Web Search Overview & CrawlingWeb Search Overview & Crawling Search Engine Issues  Scalability  Making everything work with millions of users every day, and many terabytes of documents  Distributed processing is essential  Adaptability  Changing and tuning search engine components such as ranking algorithm, indexing strategy, interface for different applications Search Engine
  • 22. Web Search Overview & CrawlingWeb Search Overview & Crawling Search Engine Issues  Spam  For Web search, spam in all its forms is one of the major issues  Affects the efficiency of search engines and, more seriously, the effectiveness of the results  Many types of spam  e.g. spamdexing or term spam, link spam, “optimization”  New subfield called adversarial IR, since spammers are “adversaries” with different goals Search Engine
  • 23. Web Search Overview & CrawlingWeb Search Overview & Crawling Architecture of SE How do search engines like Google work? Search Engine
  • 24. Web Search Overview & CrawlingWeb Search Overview & Crawling Algorithmic results. Paid Search Ads Search Engine
  • 25. Web Search Overview & CrawlingWeb Search Overview & Crawling Architecture The Web Ad indexes Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds) Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages Sponsored Links CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com Web spider Indexer Indexes Search User Search Engine
  • 26. Web Search Overview & CrawlingWeb Search Overview & Crawling Indexing Process Search Engine
  • 27. Web Search Overview & CrawlingWeb Search Overview & Crawling Indexing Process  Text acquisition  identifies and stores documents for indexing  Text transformation  transforms documents into index terms or features  Index creation  takes index terms and creates data structures (indexes) to support fast searching Search Engine
  • 28. Web Search Overview & CrawlingWeb Search Overview & Crawling Query Process Search Engine
  • 29. Web Search Overview & CrawlingWeb Search Overview & Crawling Query Process  User interaction  supports creation and refinement of query, display of results  Ranking  uses query and indexes to generate ranked list of documents  Evaluation  monitors and measures effectiveness and efficiency (primarily offline) Search Engine
  • 30. Web Search Overview & CrawlingWeb Search Overview & Crawling Details: Text Acquisition  Crawler  Identifies and acquires documents for search engine  Many types – web, enterprise, desktop  Web crawlers follow links to find documents  Must efficiently find huge numbers of web pages (coverage) and keep them up-to-date (freshness)  Single site crawlers for site search  Topical or focused crawlers for vertical search  Document crawlers for enterprise and desktop search  Follow links and scan directories Indexing Process
  • 31. Web Search Overview & CrawlingWeb Search Overview & Crawling Web Crawler  Starts with a set of seeds, which are a set of URLs given to it as parameters  Seeds are added to a URL request queue  Crawler starts fetching pages from the request queue  Downloaded pages are parsed to find link tags that might contain other useful URLs to fetch  New URLs added to the crawler’s request queue, or frontier  Continue until no more new URLs or disk full Indexing Process
  • 32. Web Search Overview & CrawlingWeb Search Overview & Crawling Crawling picture Web URLs crawled and parsed URLs frontier Unseen Web Seed pages Indexing Process
  • 33. Web Search Overview & CrawlingWeb Search Overview & Crawling Crawling the Web Indexing Process
  • 34. Web Search Overview & CrawlingWeb Search Overview & Crawling Text Acquisition  Feeds  Real-time streams of documents  e.g., web feeds for news, blogs, video, radio, tv  RSS is common standard  RSS “reader” can provide new XML documents to search engine  Conversion  Convert variety of documents into a consistent text plus metadata format  e.g. HTML, XML, Word, PDF, etc. → XML  Convert text encoding for different languages  Using a Unicode standard like UTF-8 Indexing Process
  • 35. Web Search Overview & CrawlingWeb Search Overview & Crawling Text Acquisition  Document data store  Stores text, metadata, and other related content for documents  Metadata is information about document such as type and creation date  Other content includes links, anchor text  Provides fast access to document contents for search engine components  e.g. result list generation  Could use relational database system  More typically, a simpler, more efficient storage system is used due to huge numbers of documents Indexing Process
  • 36. Web Search Overview & CrawlingWeb Search Overview & Crawling Text Transformation  Parser  Processing the sequence of text tokens in the document to recognize structural elements  e.g., titles, links, headings, etc.  Tokenizer recognizes “words” in the text  must consider issues like capitalization, hyphens, apostrophes, non-alpha characters, separators  Markup languages such as HTML, XML often used to specify structure  Tags used to specify document elements  E.g., <h2> Overview </h2>  Document parser uses syntax of markup language (or other formatting) to identify structure Indexing Process
  • 37. Web Search Overview & CrawlingWeb Search Overview & Crawling Text Transformation  Stopping  Remove common words  e.g., “and”, “or”, “the”, “in”  Some impact on efficiency and effectiveness  Can be a problem for some queries  Stemming  Group words derived from a common stem  e.g., “computer”, “computers”, “computing”, “compute”  Usually effective, but not for all queries  Benefits vary for different languages Indexing Process
  • 38. Web Search Overview & CrawlingWeb Search Overview & Crawling Text Transformation  Link Analysis  Makes use of links and anchor text in web pages  Link analysis identifies popularity and community information  e.g., PageRank  Anchor text can significantly enhance the representation of pages pointed to by links  Significant impact on web search Indexing Process
  • 39. Web Search Overview & CrawlingWeb Search Overview & Crawling Text Transformation  Information Extraction  Identify classes of index terms that are important for some applications  e.g., named entity recognizers identify classes such as people, locations, companies, dates, etc.  Classifier  Identifies class-related metadata for documents  i.e., assigns labels to documents  e.g., topics, reading levels, sentiment, genre  Use depends on application Indexing Process
  • 40. Web Search Overview & CrawlingWeb Search Overview & Crawling Index Creation  Document Statistics  Gathers counts and positions of words and other features  Used in ranking algorithm  Weighting  Computes weights for index terms  Used in ranking algorithm  e.g., tf.idf weight  Combination of term frequency in document and inverse document frequency in the collection Indexing Process
  • 41. Web Search Overview & CrawlingWeb Search Overview & Crawling Index Creation  Inversion  Core of indexing process  Converts document-term information to term-document for indexing  Difficult for very large numbers of documents  Format of inverted file is designed for fast query processing  Must also handle updates  Compression used for efficiency Indexing Process
  • 42. Web Search Overview & CrawlingWeb Search Overview & Crawling Index Creation  Index Distribution  Distributes indexes across multiple computers and/or multiple sites  Essential for fast query processing with large numbers of documents  Many variations  Document distribution, term distribution, replication  P2P and distributed IR involve search across multiple sites Indexing Process
  • 43. Web Search Overview & CrawlingWeb Search Overview & Crawling User Interaction  Query input  Provides interface and parser for query language  Most web queries are very simple, other applications may use forms  Query language used to describe more complex queries and results of query transformation  e.g., Boolean queries  similar to SQL language used in database applications  IR query languages also allow content and structure specifications, but focus on content Query Process
  • 44. Web Search Overview & CrawlingWeb Search Overview & Crawling User Interaction  Query transformation  Improves initial query, both before and after initial search  Includes text transformation techniques used for documents  Spell checking and query suggestion provide alternatives to original query  Query expansion and relevance feedback modify the original query with additional terms Query Process
  • 45. Web Search Overview & CrawlingWeb Search Overview & Crawling User Interaction  Results output  Constructs the display of ranked documents for a query  Generates snippets to show how queries match documents  Highlights important words and passages  Retrieves appropriate advertising in many applications  May provide clustering and other visualization tools Query Process
  • 46. Web Search Overview & CrawlingWeb Search Overview & Crawling Ranking  Scoring  Calculates scores for documents using a ranking algorithm  Core component of search engine  Basic form of score is ∑ qi di  qi and di are query and document term weights for term i  Many variations of ranking algorithms and retrieval models Query Process
  • 47. Web Search Overview & CrawlingWeb Search Overview & Crawling Ranking  Performance optimization  Designing ranking algorithms for efficient processing  Term-at-a time vs. document-at-a-time processing  Safe vs. unsafe optimizations  Distribution  Processing queries in a distributed environment  Query broker distributes queries and assembles results  Caching is a form of distributed searching Query Process
  • 48. Web Search Overview & CrawlingWeb Search Overview & Crawling Evaluation  Logging  Logging user queries and interaction is crucial for improving search effectiveness and efficiency  Query logs and clickthrough data used for query suggestion, spell checking, query caching, ranking, advertising search, and other components  Ranking analysis  Measuring and tuning ranking effectiveness  Performance analysis  Measuring and tuning system efficiency Query Process
  • 49. Web Search Overview & CrawlingWeb Search Overview & Crawling How Does It Really Work?  This course explains these components of a search engine in more detail  Often many possible approaches and techniques for a given component  Focus is on the most important alternatives  i.e., explain a small number of approaches in detail rather than many approaches  “Importance” based on research results and use in actual search engines  Alternatives described in references
  • 50. Web Search Overview & CrawlingWeb Search Overview & Crawling Thank you