3. INTRODUCTION
● Information retrieval is the process of retrieving documents from a collection in
response to a query or a search request by a user. An IR system is different from a
database system in that it doesn’t limit the user to a specific query language (such as
SQL or MongoDB).
● Additionally an IR system has an element of flexibility due to the fact that the user
does not need to have any prior knowledge of the schema or structure of a particular
database.
● Thus, the user is able to use free-form search request, that is taken as input by the IR
system and then processed to provide the user with desired information.
5. APPROACHES TO INFORMATION
RETRIEVAL
● Syntactical
● In the statistical approach documents
are analyzed and broken down into
chunks of text and each word or phrase
is counted, weighted and measured for
relevance or importance.
● The statistical properties are then
compared with terms from the query,
and a relevance ranking for each
document is generated. The
methods/models used for this
relevance assessment are the Boolean
model, the Vector Space model and the
Probabilistic model
● Semantic
● In the semantic approach to IR, the
syntactic, lexical, sentential and
pragmatic levels of knowledge-
understanding are used to generate a
relevance ranking for documents.
● The development of a sophisticated
semantic system requires complex
knowledge bases of semantic
information as well as retrieval
heuristics
6. WEB SEARCH AND ANALYSIS
Web Structure Analysis:
● The goal is to generate a structural representation about the
website and webpages, by focusing on the inner structure of
documents and the linking structure defined by the hyperlinks
at the inter document level.
7. Web Content Analysis:
● The Process of discovering useful information from Web
content/data/documents which contain a combination of structured and
unstructured data.
8. Web Usage Analysis:
● Web usage data describes the pattern of usage of Web pages, such as IP
Addresses, page references, and the date and time of accesses for a
user/application.
● It consists of three main phases of preprocessing, pattern discovery and
pattern analysis.
9. WEB CRAWLING
● Web Crawlers are programs that ”crawl” through the
internet generating copies of the Web pages that they visit
and downloading the content to a database.
● Web Crawlers have multiple uses, they are used by search
engines to provide fast searches to the user.
● There are 3 Types of Web crawlers:
● Focused Web Crawling.
● Distributed Web Crawling.
● Incremental Web Crawling.
10. CHALLENGES AND FURTHER TRENDS
● Challenges:
● 1.Web Structure,
● 2.Crawling and Indexing,
● 3.Searching.
11. ● Web Structure:
● Issues in defining the document collection of the web as well as examining how the
unique structure of these documents impacts how the information is retrieved.
● Crawling and Indexing:
● Crawling and Indexing present the challenge of finding an architecture that
provides fresh information as well as full coverage of the web. This is
problematic when using a centralized search engine. One question is whether
it would be beneficial to use a distributed architecture instead.
● Searching:
● Searching is to find techniques that use all available evidence to find high quality
search results for users. Doing this in an efficient way involves using known
information about the user and their history.
12. CONCLUSION
As the amount of unstructured information being generated continues to increase, it is
important for information retrieval techniques to continue to improve in order to take
full advantage of this information. The Web is a massive repository of unstructured
data and one of the foremost applications of Information Retrieval. As the Web is
still a relatively recent phenomenon, methods for improving the ways in which we
search the Web and analyze the quality of the results will continue to evolve.