Information retrieval concept, practice and challenge

Information Retrieval
Concept, Practice and Challenge
GUEST LECTURE BY GAN KENG HOON
14 APRIL 2016
SCHOOL OF COMPUTER SCIENCES, UNIVERSITI SAINS MALAYSIA.
1

Outlines
Concept
Practice
Challenge1. Conceptual Model
2. Retrieval Unit
3. Document
Representation
4. Information Needs
5. Indexing
6. Retrieval Functions
7. Evaluation
1. Search Engine
1. Cross Lingual IR
2. Big Data
3. Personalization
4. Domain Specific IR
5. ……
2

A Conceptual Model for IR
Documents
Document
Representation
Information Needs
Query
Retrieved
Documents
Indexing Formulation
Retrieval Function
Relevance Feedback
4

Definitions of IR
“Information retrieval is a field concerned with the
structure, analysis, organization, storage, searching, and
retrieval of information.” (Salton, 1968)
Information retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections
(usually stored on computers). (Manning, 2008)
5

Document/Retrieval Unit
◦ Web pages, email, books, news stories, scholarly papers, text
messages, Word™, Powerpoint™, PDF, forum postings, patents,
etc.
◦ Retrieval unit can be
◦ Part of document, e.g. a paragraph, a slide, a page etc.
◦ In the form different structure, html, xml, text etc.
◦ In different sizes/length.
6

Document Representation
Full Text Representation
◦ Keep everything. Complete.
◦ Require huge resources. Too much may not be good.
Reduced (partial) Content Representation
◦ Remove not important contents e.g. stopwords.
◦ Standardization to reduce overlapped contents e.g. stemming.
◦ Retain only important contents, e.g. noun phrases, header etc.
7

Document Representation
Think of representation as some ways of storing the document.
Bag of Words Model
Store the words as the bag (multiset) of its words, disregarding grammar
and even word order.
Document 1: "The cat sat on the hat"
Document 2: "The dog ate the cat and the hat"
From these two documents, a word list is constructed:
{ the, cat, sat, on, hat, dog, ate, and }
The list has 8 distinct words.
Document 1: { 2, 1, 1, 1, 1, 0, 0, 0 }
Document 2 : { 3, 1, 0, 0, 1, 1, 1, 1}
8

Information Needs
Those things that you want Google to give you answer are
information needs.
Example of my search history
◦ Query: weka text classification
◦ Information Need: I want to find the tutorial about using weka for text
classification.
◦ Query: Dell i7 laptop
◦ Information Need: I want to find the information any dell laptop that runs on intel
i7 processor. Actually, I want to buy, so an online store would be relevant.
9

Information Needs
Normally, you are required to formulate your information needs into
some keywords, known as query.
Simple Query
◦ Few keywords or more.
Boolean Query
◦ ‘neural network AND speech recognition’
Special Query
◦ 400 myr in usd
10

Retrieved Documents
From the original collection, a subset of documents are obtained.
What is the factor that determines what document to return?
Simple Term Matching Approach
1. Compare the terms in a document and query.
2. Compute “similarity” between each document in the collection and
the query based on the terms they have in common.
3. Sorting the document in order of decreasing similarity with the query.
4. The outputs are a ranked list and displayed to the user - the top ones
are more relevant as judged by the system.
11

Indexing
Convert documents into
representation or data structure to
improve the efficiency of retrieval.
To generate a set of useful terms
called indexes.
Why?
◦ Many variety of words used in texts,
but not all are important.
◦ Among the important words, some
are more contextually relevant.
Some basic processes involved
◦ Tokenization
◦ Stop Words Removal
◦ Stemming
◦ Phrases
◦ Inverted File
12

Indexing (Tokenization)
Convert a sequence of characters
into a sequence of tokens with
some basic meaning.
“The cat chases the mouse.”
“Bigcorp's 2007 bi-annual report
showed profits rose 10%.”
the
cat
chases
the
mouse
bigcorp
2007
bi
annual
report
showed
profits
rose
10%
13

Token can be single or multiple terms.
“Samsung Galaxy S7 Edge, redefines what a phone can do.”
samsung galaxy s7 edge
redefines
what
a
phone
can
do
samsung
galaxy
s7
edge
redefines
what
a ….
or
14

Common Issues
1. Capitalized words can have different meaning from lower case words
◦ Bush fires the officer. Query: Bush fire
◦ The bush fire lasted for 3 days. Query: bush fire
2. Apostrophes can be a part of a word, a part of a possessive, or just a
mistake
◦ rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats, master's degree,
england's ten largest cities, shriner's
15

3. Numbers can be important, including decimals
◦ nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the beat,
288358
4. Periods can occur in numbers, abbreviations, URLs, ends of sentences, and
other situations
◦ I.B.M., Ph.D., cs.umass.edu, F.E.A.R.
Note: tokenizing steps for queries must be identical to steps for documents
16

Indexing (Stopping)
Top 50 Words fromAP89 News
Collection
Recall,
Indexes should be useful term links
to a document.
Are the terms on the right figure
useful?
17

Indexing (Stopping)
Stopword list can be created from high-frequency words or based on
a standard list
Lists are customized for applications, domains, and even parts of
documents
◦ e.g., “click” is a good stopword for anchor text
Best policy is to index all words in documents, make decisions about
which words to use at query time
18

Indexing (Stemming)
Many morphological variations of words
◦ inflectional (plurals, tenses)
◦ derivational (making verbs nouns etc.)
In most cases, these have the same or very similar meanings
Stemmers attempt to reduce morphological variations of words to a
common stem
◦ usually involves removing suffixes
Can be done at indexing time or as part of query processing (like
stopwords)
19

Indexing (Stemming)
Porter Stemmer
◦ Algorithmic stemmer used in
IR experiments since the 70s
◦ Consists of a series of rules
designed to the longest
possible suffix at each step
◦ Produces stems not words
◦ Example Step 1 (right figure)
20

Indexing (Stemming)
Comparison between two stemmers.
21

Indexing (Phrases)
Recall, token, meaningful tokens are better indexes, e.g.
phrases.
Text processing issue – how are phrases recognized?
Three possible approaches:
◦ Identify syntactic phrases using a part-of-speech (POS) tagger
◦ Use word n-grams
◦ Store word positions in indexes and use proximity operators in
queries
22

Indexing (Phrases)
POS taggers use statistical models of text to predict
syntactic tags of words
◦ Example tags:
◦ NN (singular noun), NNS (plural noun), VB (verb), VBD (verb, past tense), VBN (verb, past
participle), IN (preposition), JJ (adjective), CC (conjunction, e.g., “and”, “or”), PRP (pronoun),
and MD (modal auxiliary, e.g., “can”, “will”).
Phrases can then be defined as simple noun groups, for
example
23

Indexing (Phrases)
Pos Tagging Example
24

Indexing (Phrases)
Example Noun Phrases
* Other method like N-Gram
25

Indexing (Inverted Index)
Recall, indexes are designed to support search.
Each index term is associated with an inverted list
◦ Contains lists of documents, or lists of word occurrences in documents,
and other information.
◦ Each entry is called a posting.
◦ The part of the posting that refers to a specific document or
location is called a pointer
◦ Each document in the collection is given a unique number
◦ Lists are usually document-ordered (sorted by document number)
26

Sample collection. 4 sentences from Wikipedia entry forTropical Fish
27

Simple inverted index.
28

Inverted index with
counts.
Support better ranking
algorithms.
29

Indexing
(Inverted Index)
Inverted index with
positions.
Support proximity
matching.
30

Retrieval Function
Ranking
Documents are retrieved in sorted order according to a score computing using
the document representation, the query, and a ranking algorithm
31

Retrieval Function (Boolean Retrieval)
Advantages
◦ Results are predictable, relatively easy to explain
◦ Many different features can be incorporated
◦ Efficient processing since many documents can be eliminated from search
Disadvantages
◦ Effectiveness depends entirely on user
◦ Simple queries usually don’t work well
◦ Complex queries are difficult
32

Retrieval Function (Boolean Retrieval)
Sequence of queries driven by number of retrieved documents
◦ e.g. “lincoln” search of news articles
◦ president AND lincoln
◦ president AND lincoln AND NOT (automobile OR car)
◦ president AND lincoln AND biography AND life AND birthplace AND gettysburg
AND NOT (automobile OR car)
◦ president AND lincoln AND (biography OR life OR birthplace OR gettysburg)
AND NOT (automobile OR car)
33

Retrieval Function (Vector Space Model)
Ranked based method.
Documents and query represented by a vector of term weights
Collection represented by a matrix of term weights
34

borneo daily new north straits times
D1 0 0 1 0 1 1
D2 0 1 1 0 1 0
D3 1 0 0 1 0 1
D1: new straits times
D2: new straits daily
D3 : north borneo times
Vector of terms
35

borneo daily new north straits times
D1 0 0 0.176 0 0.176 0.176
D2 0 0.477 0.176 0 0.176 0
D3 0.477 0 0 0.477 0 0.176
idf (borneo) = log(3/1) =0.477
idf (daily) = log(3/1) = 0.477
idf (new) = log(3/2) =0.176
idf (north) = log(3/1) = 0.477
idf (straits) = log(3/2) = 0.176
idf (times) = log(3/2) = 0.176
then multiply by tf
tf.idf weight
Term frequency weight measures
importance in document:
Inverse document frequency measures
importance in collection:
36

Documents ranked by distance between points representing
query and documents
◦ Similarity measure more common than a distance or dissimilarity
measure
◦ e.g. Cosine correlation
37

Consider two documents D1, D2 and a query Q
Q = “straits times”
Compare against collection,
(borneo, daily, new, north, straits, times)
Q = (0, 0, 0, 0, 0.176, 0.176)
D1 = (0, 0, 0.176, 0, 0.176, 0.176)
D2 = (0, 0.477, 0.176, 0, 0.176, 0)
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐷𝐷𝐷, 𝑄𝑄 =
0∗0 + 0∗0 + 0.176∗0 + 0∗0.176 + 0.176∗0.176 +(0.176∗0.176)
0.1762
+0.1762
+0.1762
(0.1762
+0.1762
)
=0.816
Find Cosine (D2,Q).
Which document is more relevant?
38

Evaluation
A must to evaluate the retrieval function, preprocessing steps etc.
Standard Collection
◦ Task specific
◦ Human experts are used to judge relevant results.
Performance Metric
◦ Precision
◦ Recall
39

Evaluation (Collection)
Test collectionsconsisting of documents, queries, and relevance
judgments, e.g.,
40

Obtaining relevance judgments is an expensive, time-consuming
process
◦ who does it?
◦ what are the instructions?
◦ what is the level of agreement?
42

Exhaustive judgments for all documents in a collection is not
practical
Pooling technique is used in TREC
◦ top k results (for TREC, k varied between 50 and 200) from the rankings
obtained by different search engines (or retrieval algorithms) are merged into
a pool
◦ duplicates are removed
◦ documents are presented in some random order to the relevance judges
Produces a large number of relevance judgments for each query,
although still incomplete
43

Evaluation (Effectiveness Measures)
A is set of relevant documents,
B is set of retrieved documents
44

Evaluation (Ranking Effectiveness)
45

Search Engine
The most relevant application of
Information Retrieval.
Do you agree?
Search on the Web is a daily activity for
many people throughout the world.
47

What about Database Records
Database records (or tuples in relational databases) are typically
made up of well-defined fields (or attributes)
◦ e.g., bank records with account numbers, balances, names, addresses, social
security numbers, dates of birth, etc.
Easy to compare fields with well-defined semantics to queries in
order to find matches
Text is more difficult
48

Search Query vs DB Query
Example bank database query
◦ Find records with balance > $50,000 in branches located in Amherst, MA.
◦ Matches easily found by comparison with field values of records
Example search engine query
◦ bank scandals in western mass
◦ This text must be compared to the text of entire news stories
49

Dimensions of IR
IR is more than just text, and more than just web search
◦ although these are central
People doing IR work with different media, different types
of search applications, and different tasks
50

Other Media
New applications increasingly involve new media
◦ e.g., video, photos, music, speech
Like text, content is difficult to describe and compare
◦ text may be used to represent them (e.g. tags)
IR approaches to search and evaluation are appropriate
51

Dimensions of IR
Content Applications Tasks
Text Web search Ad hoc search
Images Vertical search Filtering
Video Enterprise search Classification
Scanned docs Desktop search Question answering
Audio Forum search
Music P2P search
Literature search
52

IR and Search Engines
Relevance
-Effective ranking
Evaluation
-Testing and measuring
Information needs
-User interaction
Performance
-Efficient search and indexing
Incorporating new data
-Coverage and freshness
Scalability
-Growing with data and users
Adaptability
-Tuning for applications
Specific problems
-e.g. Spam
Information Retrieval Search Engines
53

Search Engine Issues
Performance
◦ Measuring and improving the efficiency of search
◦ e.g., reducing response time, increasing query throughput, increasing indexing speed
◦ Indexes are data structures designed to improve search efficiency
◦ designing and implementing them are major issues for search engines
54

Dynamic data
◦ The “collection” for most real applications is constantly changing in
terms of updates, additions, deletions
◦ e.g., web pages
◦ Acquiring or “crawling” the documents is a major task
◦ Typical measures are coverage (how much has been indexed) and freshness (how recently
was it indexed)
◦ Updating the indexes while processing queries is also a design
issue
55

Scalability
◦ Making everything work with millions of users every day, and many
terabytes of documents
◦ Distributed processing is essential
Adaptability
◦ Changing and tuning search engine components such as ranking
algorithm, indexing strategy, interface for different applications
56

Let’s Define the Challenge Together
1. Cross Lingual IR
2. Big Data
3. Personalization
4. Domain Specific IR
5. Multi modal IR
6. ….
58

IR Research Direction
Latest Research at Google
http://research.google.com/pubs/In
formationRetrievalandtheWeb.html
59

Acknowledgement
Thank you for Watching …
This presentation is prepared with
some adapted contents from the
following sources.
a. Introduction to Information
Retrieval, C. D. Manning, P.
Raghavan and H. Schutze, 2009.
b. Introduction to Information
Retrieval, IR Summer School 2001,
Mounia, Lalmas.
c. Search Engines: Information
Retrieval in Practice, D. Metzler, T.
Strohman and W. B. Croft, 2009.
60

Information retrieval concept, practice and challenge

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Information retrieval concept, practice and challenge

Similar to Information retrieval concept, practice and challenge (20)

More from Gan Keng Hoon

More from Gan Keng Hoon (16)

Recently uploaded

Recently uploaded (20)

Information retrieval concept, practice and challenge