This document discusses different information retrieval models including the Boolean model, vector space model, and probabilistic model. It focuses on describing the Boolean model and its drawbacks. Term frequency-inverse document frequency (TF-IDF) weighting is explained as a way to assign weights to terms based on frequency and document distribution. Cosine similarity is presented as a common way to measure similarity between a document vector and query vector in the vector space model.
The terms of a document are not equally useful for describing the document contents
In fact, there are index terms which are simply vaguer than others
There are properties of an index term which are useful for evaluating the importance of the term in a document
Information retrieval 10 vector and probabilistic modelsVaibhav Khanna
Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings.
Vector space model or term vector model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System
TF-IDF, short for Term Frequency - Inverse Document Frequency, is a text mining technique, that gives a numeric statistic as to how important a word is to a document in a collection or corpus. This is a technique used to categorize documents according to certain words and their importance to the document
The terms of a document are not equally useful for describing the document contents
In fact, there are index terms which are simply vaguer than others
There are properties of an index term which are useful for evaluating the importance of the term in a document
Information retrieval 10 vector and probabilistic modelsVaibhav Khanna
Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings.
Vector space model or term vector model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System
TF-IDF, short for Term Frequency - Inverse Document Frequency, is a text mining technique, that gives a numeric statistic as to how important a word is to a document in a collection or corpus. This is a technique used to categorize documents according to certain words and their importance to the document
The (standard) Boolean model of information retrieval (BIR) is a classical information retrieval (IR) model and, at the same time, the first and most-adopted one. ... The BIR is based on Boolean logic and classical set theory in that both the documents to be searched and the user's query are conceived as sets of terms.
This presentation is an introduction to artificial intelligence: knowledge engineering. Topics covered are the following: knowledge engineering, requirements of expert systems (ES), functional requirements of ES, structural requirements of ES, components of ES/KBS, knowledge base, inference engine, working memory, expert system, explanation facility, user interface, will ES work for my problem.
This describes the supervised machine learning, supervised learning categorisation( regression and classification) and their types, applications of supervised machine learning, etc.
it contains the detail information about Dynamic programming, Knapsack problem, Forward / backward knapsack, Optimal Binary Search Tree (OBST), Traveling sales person problem(TSP) using dynamic programming
The (standard) Boolean model of information retrieval (BIR) is a classical information retrieval (IR) model and, at the same time, the first and most-adopted one. ... The BIR is based on Boolean logic and classical set theory in that both the documents to be searched and the user's query are conceived as sets of terms.
This presentation is an introduction to artificial intelligence: knowledge engineering. Topics covered are the following: knowledge engineering, requirements of expert systems (ES), functional requirements of ES, structural requirements of ES, components of ES/KBS, knowledge base, inference engine, working memory, expert system, explanation facility, user interface, will ES work for my problem.
This describes the supervised machine learning, supervised learning categorisation( regression and classification) and their types, applications of supervised machine learning, etc.
it contains the detail information about Dynamic programming, Knapsack problem, Forward / backward knapsack, Optimal Binary Search Tree (OBST), Traveling sales person problem(TSP) using dynamic programming
Information retrieval 20 divergence from randomnessVaibhav Khanna
Divergence from randomness, one of the very first models, is one type of probabilistic model. It is basically used to test the amount of information carried in the documents. It is based on Harter's 2-Poisson indexing-model. The 2-Poisson model has a hypothesis that the level of the documents is related to a set of documents which contains words occur relatively greater than the rest of the documents
A Formal Account of Effectiveness Evaluation and Ranking FusionDamiano Spina
Slides of my presentation at ICTIR'18
September, 17 2018
Tianjin, China
Paper: https://dl.acm.org/citation.cfm?id=3234958
Version with formal proofs: https://arxiv.org/abs/1807.04317
International Journal of Engineering Research and Development (IJERD)IJERD Editor
International Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
1. UNIT-II(Modelling And Retrieval
Evaluation )
IV Year / VIII Semester
By
P.THENMOZHI AP/CSE
KNCET.
KONGUNADU COLLEGE OF ENGINEERING AND
TECHNOLOGY
(Autonomous)
NAMAKKAL- TRICHY MAIN ROAD, THOTTIAM
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING
CS8080 – Information Retrieval Techniques
3. MODELING AND RETRIEVAL
EVALUATION
• Basic Retrieval Models
• An IR model governs how a document and a
query are represented and how the relevance
of a document to a user query is defined.
• There are Three main IR models:
– Boolean model
– Vector space model
– Probabilistic model
4. • Each term is associated with a weight.Given a
collection of documents D, let
• V = {t1, t2... t|V|} be the set of distinctive
terms in the collection, where ti is a term.
• The set V is usually called the vocabulary of
the collection, and |V| is its size,
• i.e., the number of terms in V.
5. • An IR model is a quadruple [D, Q, F, R(qi, dj)]
where
• 1. D is a set of logical views for the documents
in the collection
• 2. Q is a set of logical views for the user
queries
• 3. F is a framework for modeling documents
and queries
• 4. R(qi, dj) is a ranking function
6.
7. Boolean Model
• The Boolean model is one of the earliest and
simplest information retrieval models.
• It uses the notion of exact matching to match
documents to the user query.
• Both the query and the retrieval are based on
Boolean algebra.
8. • In the Boolean model, documents and queries
are represented as sets of terms.
• That is, each term is only considered present
or absent in a document.
9. • Boolean Queries:
• Query terms are combined logically using the Boolean
operators AND, OR, and NOT, which have their usual
semantics in logic.
• Thus, a Boolean query has a precise semantics.
• For instance, the query, ((x AND y) AND (NOT z)) says
that a retrieved document must contain both the terms
x and y but not z.
• As another example, the query expression (x OR y)
means that at least one of these terms must be in each
retrieved document.
• Here, we assume that x, y and z are terms. In general,
they can be Boolean expressions themselves.
10. • Document Retrieval:
• Given a Boolean query, the system retrieves
every document that makes the query
logically true.
• Thus, the retrieval is based on the binary
decision criterion, i.e., a document is either
relevant or irrelevant. Intuitively, this is called
exact match.
• Most search engines support some limited
forms of Boolean retrieval using explicit
inclusion and exclusion operators.
11. • Drawbacks of the Boolean Model
• No ranking of the documents is provided
(absence of a grading scale)
• Information need has to be translated into a
Boolean expression, which most users find
awkward
• The Boolean queries formulated by the users
are most often too simplistic.
12. TF-IDF (Term Frequency/Inverse
Document Frequency) Weighting
• We assign to each term in a document a
weight for that term that depends on the
number of occurrences of the term in the
document.
• We would like to compute a score between a
query term t and a document d, based on the
weight of t in d. The simplest approach is to
assign the weight to be equal to the number
of occurrences of term t in document d.
13. • This weighting scheme is referred to as term
frequency and is denoted tft,d, with the
subscripts denoting the term and the
document in order.
• For a document d, the set of weights
determined by the tf weights above (or indeed
any weighting function that maps the number
of occurrences of t in d to a positive real
value) may be viewed as a quantitative digest
of that document.
14. • How is the document frequency df of a term
used to scale its weight? Denoting as usual the
total number of documents in a collection by
N, we define the inverse document frequency
(idf) of a term t as follows:
• idft = log
𝑁
𝑑𝑓𝑡
15. • Tf-idf weighting
• We now combine the definitions of term
frequency and inverse document frequency, to
produce a composite weight for each term in
each document.
• The tf-idf weighting scheme assigns to term t
a weight in document d given by
•
• tf-idft,d = tft,d ×idft.
16. • Document d is the sum, over all query terms,
of the number of times each of the query
terms occurs in d.
• We can refine this idea so that we add up not
the number of occurrences of each query
term t in d, but instead the tf-idf weight of
each term in d.
• Score (q, d) = 𝑡∈𝑞 tf − idf𝑡, 𝑑.
17. Cosine similarity
• Documents could be ranked by computing the distance between
the points representing the documents and the query.
• More commonly, a similarity measure is used (rather than a
distance or dissimilarity measure), so that the documents with the
highest scores are the most similar to the query.
• A number of similarity measures have been proposed and tested
for this purpose.
• The most successful of these is the cosine correlation similarity
measure.
• The cosine correlation measures the cosine of the angle between
the query and the document vectors.
• When the vectors are normalized so that all documents and queries
are represented by vectors of equal length, the cosine of the angle
between two identical vectors will be 1 (the angle is zero), and for
two vectors that do not share any non-zero terms, the cosine will
be 0.
18. • The cosine measure is defined as:
• 𝐶𝑜𝑠𝑖𝑛𝑒(𝐷𝑖, 𝑄) =
𝑗=1
𝑡
𝑑𝑖𝑗 · 𝑞𝑗
𝑗=1
𝑡
𝑑𝑖𝑗2. 𝑗=1
𝑡
𝑞𝑗2
• The numerator of this measure is the sum of the products
of the term weights for the matching query and document
terms (known as the dot product or inner product).
• The denominator normalizes this score by dividing by the
product of the lengths of the two vectors. There is no
theoretical reason why the cosine correlation should be
preferred to other similarity measures, but it does perform
somewhat better in evaluations of search quality.