SlideShare a Scribd company logo
1 of 60
Download to read offline
Information Retrieval
Concept, Practice and Challenge
GUEST LECTURE BY GAN KENG HOON
14 APRIL 2016
SCHOOL OF COMPUTER SCIENCES, UNIVERSITI SAINS MALAYSIA.
1
Outlines
Concept
Practice
Challenge1. Conceptual Model
2. Retrieval Unit
3. Document
Representation
4. Information Needs
5. Indexing
6. Retrieval Functions
7. Evaluation
1. Search Engine
1. Cross Lingual IR
2. Big Data
3. Personalization
4. Domain Specific IR
5. ā€¦ā€¦
2
Concept
3
A Conceptual Model for IR
Documents
Document
Representation
Information Needs
Query
Retrieved
Documents
Indexing Formulation
Retrieval Function
Relevance Feedback
4
Definitions of IR
ā€œInformation retrieval is a field concerned with the
structure, analysis, organization, storage, searching, and
retrieval of information.ā€ (Salton, 1968)
Information retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections
(usually stored on computers). (Manning, 2008)
5
Document/Retrieval Unit
ā—¦ Web pages, email, books, news stories, scholarly papers, text
messages, Wordā„¢, Powerpointā„¢, PDF, forum postings, patents,
etc.
ā—¦ Retrieval unit can be
ā—¦ Part of document, e.g. a paragraph, a slide, a page etc.
ā—¦ In the form different structure, html, xml, text etc.
ā—¦ In different sizes/length.
6
Document Representation
Full Text Representation
ā—¦ Keep everything. Complete.
ā—¦ Require huge resources. Too much may not be good.
Reduced (partial) Content Representation
ā—¦ Remove not important contents e.g. stopwords.
ā—¦ Standardization to reduce overlapped contents e.g. stemming.
ā—¦ Retain only important contents, e.g. noun phrases, header etc.
7
Document Representation
Think of representation as some ways of storing the document.
Bag of Words Model
Store the words as the bag (multiset) of its words, disregarding grammar
and even word order.
Document 1: "The cat sat on the hat"
Document 2: "The dog ate the cat and the hat"
From these two documents, a word list is constructed:
{ the, cat, sat, on, hat, dog, ate, and }
The list has 8 distinct words.
Document 1: { 2, 1, 1, 1, 1, 0, 0, 0 }
Document 2 : { 3, 1, 0, 0, 1, 1, 1, 1}
8
Information Needs
Those things that you want Google to give you answer are
information needs.
Example of my search history
ā—¦ Query: weka text classification
ā—¦ Information Need: I want to find the tutorial about using weka for text
classification.
ā—¦ Query: Dell i7 laptop
ā—¦ Information Need: I want to find the information any dell laptop that runs on intel
i7 processor. Actually, I want to buy, so an online store would be relevant.
9
Information Needs
Normally, you are required to formulate your information needs into
some keywords, known as query.
Simple Query
ā—¦ Few keywords or more.
Boolean Query
ā—¦ ā€˜neural network AND speech recognitionā€™
Special Query
ā—¦ 400 myr in usd
10
Retrieved Documents
From the original collection, a subset of documents are obtained.
What is the factor that determines what document to return?
Simple Term Matching Approach
1. Compare the terms in a document and query.
2. Compute ā€œsimilarityā€ between each document in the collection and
the query based on the terms they have in common.
3. Sorting the document in order of decreasing similarity with the query.
4. The outputs are a ranked list and displayed to the user - the top ones
are more relevant as judged by the system.
11
Indexing
Convert documents into
representation or data structure to
improve the efficiency of retrieval.
To generate a set of useful terms
called indexes.
Why?
ā—¦ Many variety of words used in texts,
but not all are important.
ā—¦ Among the important words, some
are more contextually relevant.
Some basic processes involved
ā—¦ Tokenization
ā—¦ Stop Words Removal
ā—¦ Stemming
ā—¦ Phrases
ā—¦ Inverted File
12
Indexing (Tokenization)
Convert a sequence of characters
into a sequence of tokens with
some basic meaning.
ā€œThe cat chases the mouse.ā€
ā€œBigcorp's 2007 bi-annual report
showed profits rose 10%.ā€
the
cat
chases
the
mouse
bigcorp
2007
bi
annual
report
showed
profits
rose
10%
13
Indexing (Tokenization)
Token can be single or multiple terms.
ā€œSamsung Galaxy S7 Edge, redefines what a phone can do.ā€
samsung galaxy s7 edge
redefines
what
a
phone
can
do
samsung
galaxy
s7
edge
redefines
what
a ā€¦.
or
14
Indexing (Tokenization)
Common Issues
1. Capitalized words can have different meaning from lower case words
ā—¦ Bush fires the officer. Query: Bush fire
ā—¦ The bush fire lasted for 3 days. Query: bush fire
2. Apostrophes can be a part of a word, a part of a possessive, or just a
mistake
ā—¦ rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats, master's degree,
england's ten largest cities, shriner's
15
Indexing (Tokenization)
3. Numbers can be important, including decimals
ā—¦ nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the beat,
288358
4. Periods can occur in numbers, abbreviations, URLs, ends of sentences, and
other situations
ā—¦ I.B.M., Ph.D., cs.umass.edu, F.E.A.R.
Note: tokenizing steps for queries must be identical to steps for documents
16
Indexing (Stopping)
Top 50 Words fromAP89 News
Collection
Recall,
Indexes should be useful term links
to a document.
Are the terms on the right figure
useful?
17
Indexing (Stopping)
Stopword list can be created from high-frequency words or based on
a standard list
Lists are customized for applications, domains, and even parts of
documents
ā—¦ e.g., ā€œclickā€ is a good stopword for anchor text
Best policy is to index all words in documents, make decisions about
which words to use at query time
18
Indexing (Stemming)
Many morphological variations of words
ā—¦ inflectional (plurals, tenses)
ā—¦ derivational (making verbs nouns etc.)
In most cases, these have the same or very similar meanings
Stemmers attempt to reduce morphological variations of words to a
common stem
ā—¦ usually involves removing suffixes
Can be done at indexing time or as part of query processing (like
stopwords)
19
Indexing (Stemming)
Porter Stemmer
ā—¦ Algorithmic stemmer used in
IR experiments since the 70s
ā—¦ Consists of a series of rules
designed to the longest
possible suffix at each step
ā—¦ Produces stems not words
ā—¦ Example Step 1 (right figure)
20
Indexing (Stemming)
Comparison between two stemmers.
21
Indexing (Phrases)
Recall, token, meaningful tokens are better indexes, e.g.
phrases.
Text processing issue ā€“ how are phrases recognized?
Three possible approaches:
ā—¦ Identify syntactic phrases using a part-of-speech (POS) tagger
ā—¦ Use word n-grams
ā—¦ Store word positions in indexes and use proximity operators in
queries
22
Indexing (Phrases)
POS taggers use statistical models of text to predict
syntactic tags of words
ā—¦ Example tags:
ā—¦ NN (singular noun), NNS (plural noun), VB (verb), VBD (verb, past tense), VBN (verb, past
participle), IN (preposition), JJ (adjective), CC (conjunction, e.g., ā€œandā€, ā€œorā€), PRP (pronoun),
and MD (modal auxiliary, e.g., ā€œcanā€, ā€œwillā€).
Phrases can then be defined as simple noun groups, for
example
23
Indexing (Phrases)
Pos Tagging Example
24
Indexing (Phrases)
Example Noun Phrases
* Other method like N-Gram
25
Indexing (Inverted Index)
Recall, indexes are designed to support search.
Each index term is associated with an inverted list
ā—¦ Contains lists of documents, or lists of word occurrences in documents,
and other information.
ā—¦ Each entry is called a posting.
ā—¦ The part of the posting that refers to a specific document or
location is called a pointer
ā—¦ Each document in the collection is given a unique number
ā—¦ Lists are usually document-ordered (sorted by document number)
26
Indexing (Inverted Index)
Sample collection. 4 sentences from Wikipedia entry forTropical Fish
27
Indexing (Inverted Index)
Simple inverted index.
28
Indexing (Inverted Index)
Inverted index with
counts.
Support better ranking
algorithms.
29
Indexing
(Inverted Index)
Inverted index with
positions.
Support proximity
matching.
30
Retrieval Function
Ranking
Documents are retrieved in sorted order according to a score computing using
the document representation, the query, and a ranking algorithm
31
Retrieval Function (Boolean Retrieval)
Advantages
ā—¦ Results are predictable, relatively easy to explain
ā—¦ Many different features can be incorporated
ā—¦ Efficient processing since many documents can be eliminated from search
Disadvantages
ā—¦ Effectiveness depends entirely on user
ā—¦ Simple queries usually donā€™t work well
ā—¦ Complex queries are difficult
32
Retrieval Function (Boolean Retrieval)
Sequence of queries driven by number of retrieved documents
ā—¦ e.g. ā€œlincolnā€ search of news articles
ā—¦ president AND lincoln
ā—¦ president AND lincoln AND NOT (automobile OR car)
ā—¦ president AND lincoln AND biography AND life AND birthplace AND gettysburg
AND NOT (automobile OR car)
ā—¦ president AND lincoln AND (biography OR life OR birthplace OR gettysburg)
AND NOT (automobile OR car)
33
Retrieval Function (Vector Space Model)
Ranked based method.
Documents and query represented by a vector of term weights
Collection represented by a matrix of term weights
34
Retrieval Function (Vector Space Model)
borneo daily new north straits times
D1 0 0 1 0 1 1
D2 0 1 1 0 1 0
D3 1 0 0 1 0 1
D1: new straits times
D2: new straits daily
D3 : north borneo times
Vector of terms
35
Retrieval Function (Vector Space Model)
borneo daily new north straits times
D1 0 0 0.176 0 0.176 0.176
D2 0 0.477 0.176 0 0.176 0
D3 0.477 0 0 0.477 0 0.176
idf (borneo) = log(3/1) =0.477
idf (daily) = log(3/1) = 0.477
idf (new) = log(3/2) =0.176
idf (north) = log(3/1) = 0.477
idf (straits) = log(3/2) = 0.176
idf (times) = log(3/2) = 0.176
then multiply by tf
tf.idf weight
Term frequency weight measures
importance in document:
Inverse document frequency measures
importance in collection:
36
Retrieval Function (Vector Space Model)
Documents ranked by distance between points representing
query and documents
ā—¦ Similarity measure more common than a distance or dissimilarity
measure
ā—¦ e.g. Cosine correlation
37
Retrieval Function (Vector Space Model)
Consider two documents D1, D2 and a query Q
Q = ā€œstraits timesā€
Compare against collection,
(borneo, daily, new, north, straits, times)
Q = (0, 0, 0, 0, 0.176, 0.176)
D1 = (0, 0, 0.176, 0, 0.176, 0.176)
D2 = (0, 0.477, 0.176, 0, 0.176, 0)
š¶š¶š¶š¶š¶š¶š¶š¶š¶š¶š¶š¶ š·š·š·, š‘„š‘„ =
0āˆ—0 + 0āˆ—0 + 0.176āˆ—0 + 0āˆ—0.176 + 0.176āˆ—0.176 +(0.176āˆ—0.176)
0.1762
+0.1762
+0.1762
(0.1762
+0.1762
)
=0.816
Find Cosine (D2,Q).
Which document is more relevant?
38
Evaluation
A must to evaluate the retrieval function, preprocessing steps etc.
Standard Collection
ā—¦ Task specific
ā—¦ Human experts are used to judge relevant results.
Performance Metric
ā—¦ Precision
ā—¦ Recall
39
Evaluation (Collection)
Test collectionsconsisting of documents, queries, and relevance
judgments, e.g.,
40
Evaluation (Collection)
41
Evaluation (Collection)
Obtaining relevance judgments is an expensive, time-consuming
process
ā—¦ who does it?
ā—¦ what are the instructions?
ā—¦ what is the level of agreement?
42
Evaluation (Collection)
Exhaustive judgments for all documents in a collection is not
practical
Pooling technique is used in TREC
ā—¦ top k results (for TREC, k varied between 50 and 200) from the rankings
obtained by different search engines (or retrieval algorithms) are merged into
a pool
ā—¦ duplicates are removed
ā—¦ documents are presented in some random order to the relevance judges
Produces a large number of relevance judgments for each query,
although still incomplete
43
Evaluation (Effectiveness Measures)
A is set of relevant documents,
B is set of retrieved documents
44
Evaluation (Ranking Effectiveness)
45
Practice
SEARCH ENGINE
46
Search Engine
The most relevant application of
Information Retrieval.
Do you agree?
Search on the Web is a daily activity for
many people throughout the world.
47
What about Database Records
Database records (or tuples in relational databases) are typically
made up of well-defined fields (or attributes)
ā—¦ e.g., bank records with account numbers, balances, names, addresses, social
security numbers, dates of birth, etc.
Easy to compare fields with well-defined semantics to queries in
order to find matches
Text is more difficult
48
Search Query vs DB Query
Example bank database query
ā—¦ Find records with balance > $50,000 in branches located in Amherst, MA.
ā—¦ Matches easily found by comparison with field values of records
Example search engine query
ā—¦ bank scandals in western mass
ā—¦ This text must be compared to the text of entire news stories
49
Dimensions of IR
IR is more than just text, and more than just web search
ā—¦ although these are central
People doing IR work with different media, different types
of search applications, and different tasks
50
Other Media
New applications increasingly involve new media
ā—¦ e.g., video, photos, music, speech
Like text, content is difficult to describe and compare
ā—¦ text may be used to represent them (e.g. tags)
IR approaches to search and evaluation are appropriate
51
Dimensions of IR
Content Applications Tasks
Text Web search Ad hoc search
Images Vertical search Filtering
Video Enterprise search Classification
Scanned docs Desktop search Question answering
Audio Forum search
Music P2P search
Literature search
52
IR and Search Engines
Relevance
-Effective ranking
Evaluation
-Testing and measuring
Information needs
-User interaction
Performance
-Efficient search and indexing
Incorporating new data
-Coverage and freshness
Scalability
-Growing with data and users
Adaptability
-Tuning for applications
Specific problems
-e.g. Spam
Information Retrieval Search Engines
53
Search Engine Issues
Performance
ā—¦ Measuring and improving the efficiency of search
ā—¦ e.g., reducing response time, increasing query throughput, increasing indexing speed
ā—¦ Indexes are data structures designed to improve search efficiency
ā—¦ designing and implementing them are major issues for search engines
54
Search Engine Issues
Dynamic data
ā—¦ The ā€œcollectionā€ for most real applications is constantly changing in
terms of updates, additions, deletions
ā—¦ e.g., web pages
ā—¦ Acquiring or ā€œcrawlingā€ the documents is a major task
ā—¦ Typical measures are coverage (how much has been indexed) and freshness (how recently
was it indexed)
ā—¦ Updating the indexes while processing queries is also a design
issue
55
Search Engine Issues
Scalability
ā—¦ Making everything work with millions of users every day, and many
terabytes of documents
ā—¦ Distributed processing is essential
Adaptability
ā—¦ Changing and tuning search engine components such as ranking
algorithm, indexing strategy, interface for different applications
56
Challenge
57
Letā€™s Define the Challenge Together
1. Cross Lingual IR
2. Big Data
3. Personalization
4. Domain Specific IR
5. Multi modal IR
6. ā€¦.
58
IR Research Direction
Latest Research at Google
http://research.google.com/pubs/In
formationRetrievalandtheWeb.html
59
Acknowledgement
Thank you for Watching ā€¦
This presentation is prepared with
some adapted contents from the
following sources.
a. Introduction to Information
Retrieval, C. D. Manning, P.
Raghavan and H. Schutze, 2009.
b. Introduction to Information
Retrieval, IR Summer School 2001,
Mounia, Lalmas.
c. Search Engines: Information
Retrieval in Practice, D. Metzler, T.
Strohman and W. B. Croft, 2009.
60

More Related Content

What's hot

Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval ModelsNisha Arankandath
Ā 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)9866825059
Ā 
Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval Abhay Ratnaparkhi
Ā 
Chain indexing
Chain indexingChain indexing
Chain indexingsilambu111
Ā 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrievalssbd6985
Ā 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsSelman Bozkır
Ā 
Interoperability Protocols and Standards in LIS
Interoperability Protocols and Standards in LISInteroperability Protocols and Standards in LIS
Interoperability Protocols and Standards in LISADINET Ahmedabad
Ā 
Technical services presentation
Technical services presentation Technical services presentation
Technical services presentation Ali Hassan Maken
Ā 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval modelbaradhimarch81
Ā 
Text mining
Text miningText mining
Text miningAli A Jalil
Ā 
Z39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol pptZ39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol pptSUNILKUMARSINGH
Ā 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic WebTomek Pluskiewicz
Ā 
Research Methods in Library and Information Science: Trends and Tips for Rese...
Research Methods in Library and Information Science: Trends and Tips for Rese...Research Methods in Library and Information Science: Trends and Tips for Rese...
Research Methods in Library and Information Science: Trends and Tips for Rese...OCLC
Ā 
Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction Grace Hui Yang
Ā 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notesAnandh Arumugakan
Ā 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsMounia Lalmas-Roelleke
Ā 
Ppt evaluation of information retrieval system
Ppt evaluation of information retrieval systemPpt evaluation of information retrieval system
Ppt evaluation of information retrieval systemsilambu111
Ā 
Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 wordsananth
Ā 

What's hot (20)

Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval Models
Ā 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
Ā 
Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval
Ā 
Chain indexing
Chain indexingChain indexing
Chain indexing
Ā 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
Ā 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
Ā 
Interoperability Protocols and Standards in LIS
Interoperability Protocols and Standards in LISInteroperability Protocols and Standards in LIS
Interoperability Protocols and Standards in LIS
Ā 
Technical services presentation
Technical services presentation Technical services presentation
Technical services presentation
Ā 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval model
Ā 
Text mining
Text miningText mining
Text mining
Ā 
DESIDOC
DESIDOC DESIDOC
DESIDOC
Ā 
Z39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol pptZ39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol ppt
Ā 
International Digital Library Initiatives
International Digital Library InitiativesInternational Digital Library Initiatives
International Digital Library Initiatives
Ā 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
Ā 
Research Methods in Library and Information Science: Trends and Tips for Rese...
Research Methods in Library and Information Science: Trends and Tips for Rese...Research Methods in Library and Information Science: Trends and Tips for Rese...
Research Methods in Library and Information Science: Trends and Tips for Rese...
Ā 
Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction
Ā 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
Ā 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
Ā 
Ppt evaluation of information retrieval system
Ppt evaluation of information retrieval systemPpt evaluation of information retrieval system
Ppt evaluation of information retrieval system
Ā 
Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 words
Ā 

Similar to Information retrieval concept, practice and challenge

Concepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineConcepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineGan Keng Hoon
Ā 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
Ā 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Lucidworks
Ā 
Semantic Search Component
Semantic Search ComponentSemantic Search Component
Semantic Search ComponentMario Flecha
Ā 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Anubhav Jain
Ā 
Text Analytics - JCC2014 Kimelfeld
Text Analytics - JCC2014 KimelfeldText Analytics - JCC2014 Kimelfeld
Text Analytics - JCC2014 KimelfeldPedro Contreras Flores
Ā 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Anubhav Jain
Ā 
A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Langu...
A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Langu...A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Langu...
A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Langu...Databricks
Ā 
DBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support EngineDBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support EngineYi Zeng
Ā 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkKrishna Sankar
Ā 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and RetrievalOptum
Ā 
Effective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP SystemsEffective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP SystemsAndre Freitas
Ā 
SLA Summer 2008
SLA Summer 2008SLA Summer 2008
SLA Summer 2008Joe Buzzanga
Ā 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
Ā 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
Ā 
Text Mining
Text MiningText Mining
Text Miningsathish sak
Ā 
Search Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your CustomersSearch Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your Customersrichwig
Ā 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text miningIRJET Journal
Ā 
Chapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdfChapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdfJemalNesre1
Ā 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...Thomas Rones
Ā 

Similar to Information retrieval concept, practice and challenge (20)

Concepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineConcepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search Engine
Ā 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
Ā 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Ā 
Semantic Search Component
Semantic Search ComponentSemantic Search Component
Semantic Search Component
Ā 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Ā 
Text Analytics - JCC2014 Kimelfeld
Text Analytics - JCC2014 KimelfeldText Analytics - JCC2014 Kimelfeld
Text Analytics - JCC2014 Kimelfeld
Ā 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Ā 
A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Langu...
A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Langu...A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Langu...
A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Langu...
Ā 
DBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support EngineDBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support Engine
Ā 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Ā 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and Retrieval
Ā 
Effective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP SystemsEffective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP Systems
Ā 
SLA Summer 2008
SLA Summer 2008SLA Summer 2008
SLA Summer 2008
Ā 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Ā 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
Ā 
Text Mining
Text MiningText Mining
Text Mining
Ā 
Search Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your CustomersSearch Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your Customers
Ā 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text mining
Ā 
Chapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdfChapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdf
Ā 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
Ā 

More from Gan Keng Hoon

A View of Text Analytics from Word, Sentence and Document Levels
A View of Text Analytics from Word, Sentence and Document Levels A View of Text Analytics from Word, Sentence and Document Levels
A View of Text Analytics from Word, Sentence and Document Levels Gan Keng Hoon
Ā 
Keywords Discovery with Simple Text Mining using R
Keywords Discovery with Simple Text Mining using RKeywords Discovery with Simple Text Mining using R
Keywords Discovery with Simple Text Mining using RGan Keng Hoon
Ā 
OSS 2020 Using SOLR as Open-Source Search Platform.pdf
OSS 2020 Using SOLR as Open-Source Search Platform.pdfOSS 2020 Using SOLR as Open-Source Search Platform.pdf
OSS 2020 Using SOLR as Open-Source Search Platform.pdfGan Keng Hoon
Ā 
Procrastination and Phd.pdf
Procrastination and Phd.pdfProcrastination and Phd.pdf
Procrastination and Phd.pdfGan Keng Hoon
Ā 
Guest Lecture for Principles of Data Analytics.pdf
Guest Lecture for Principles of Data Analytics.pdfGuest Lecture for Principles of Data Analytics.pdf
Guest Lecture for Principles of Data Analytics.pdfGan Keng Hoon
Ā 
Knowledge Representation Reasoning and Acquisition.pdf
Knowledge Representation Reasoning and Acquisition.pdfKnowledge Representation Reasoning and Acquisition.pdf
Knowledge Representation Reasoning and Acquisition.pdfGan Keng Hoon
Ā 
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...Gan Keng Hoon
Ā 
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...Gan Keng Hoon
Ā 
Text and Sentiment Analytics for Business Intelligence
Text and Sentiment Analytics for Business IntelligenceText and Sentiment Analytics for Business Intelligence
Text and Sentiment Analytics for Business IntelligenceGan Keng Hoon
Ā 
Category & Training Texts Selection for Scientific Article Categorization in ...
Category & Training Texts Selection for Scientific Article Categorization in ...Category & Training Texts Selection for Scientific Article Categorization in ...
Category & Training Texts Selection for Scientific Article Categorization in ...Gan Keng Hoon
Ā 
Semantics in Retrieval
Semantics in Retrieval Semantics in Retrieval
Semantics in Retrieval Gan Keng Hoon
Ā 
Faceted Search for Finding Expertise Bibliographies
Faceted Search for Finding Expertise BibliographiesFaceted Search for Finding Expertise Bibliographies
Faceted Search for Finding Expertise BibliographiesGan Keng Hoon
Ā 
ACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise SearchACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise SearchGan Keng Hoon
Ā 
A Brief Introduction to Knowledge Acquisition, Representation and Publishing
A Brief Introduction to Knowledge Acquisition, Representation and PublishingA Brief Introduction to Knowledge Acquisition, Representation and Publishing
A Brief Introduction to Knowledge Acquisition, Representation and PublishingGan Keng Hoon
Ā 
Wi 2015 demo_preview
Wi 2015 demo_previewWi 2015 demo_preview
Wi 2015 demo_previewGan Keng Hoon
Ā 
An overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support SystemAn overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support SystemGan Keng Hoon
Ā 

More from Gan Keng Hoon (16)

A View of Text Analytics from Word, Sentence and Document Levels
A View of Text Analytics from Word, Sentence and Document Levels A View of Text Analytics from Word, Sentence and Document Levels
A View of Text Analytics from Word, Sentence and Document Levels
Ā 
Keywords Discovery with Simple Text Mining using R
Keywords Discovery with Simple Text Mining using RKeywords Discovery with Simple Text Mining using R
Keywords Discovery with Simple Text Mining using R
Ā 
OSS 2020 Using SOLR as Open-Source Search Platform.pdf
OSS 2020 Using SOLR as Open-Source Search Platform.pdfOSS 2020 Using SOLR as Open-Source Search Platform.pdf
OSS 2020 Using SOLR as Open-Source Search Platform.pdf
Ā 
Procrastination and Phd.pdf
Procrastination and Phd.pdfProcrastination and Phd.pdf
Procrastination and Phd.pdf
Ā 
Guest Lecture for Principles of Data Analytics.pdf
Guest Lecture for Principles of Data Analytics.pdfGuest Lecture for Principles of Data Analytics.pdf
Guest Lecture for Principles of Data Analytics.pdf
Ā 
Knowledge Representation Reasoning and Acquisition.pdf
Knowledge Representation Reasoning and Acquisition.pdfKnowledge Representation Reasoning and Acquisition.pdf
Knowledge Representation Reasoning and Acquisition.pdf
Ā 
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
Ā 
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
Ā 
Text and Sentiment Analytics for Business Intelligence
Text and Sentiment Analytics for Business IntelligenceText and Sentiment Analytics for Business Intelligence
Text and Sentiment Analytics for Business Intelligence
Ā 
Category & Training Texts Selection for Scientific Article Categorization in ...
Category & Training Texts Selection for Scientific Article Categorization in ...Category & Training Texts Selection for Scientific Article Categorization in ...
Category & Training Texts Selection for Scientific Article Categorization in ...
Ā 
Semantics in Retrieval
Semantics in Retrieval Semantics in Retrieval
Semantics in Retrieval
Ā 
Faceted Search for Finding Expertise Bibliographies
Faceted Search for Finding Expertise BibliographiesFaceted Search for Finding Expertise Bibliographies
Faceted Search for Finding Expertise Bibliographies
Ā 
ACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise SearchACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise Search
Ā 
A Brief Introduction to Knowledge Acquisition, Representation and Publishing
A Brief Introduction to Knowledge Acquisition, Representation and PublishingA Brief Introduction to Knowledge Acquisition, Representation and Publishing
A Brief Introduction to Knowledge Acquisition, Representation and Publishing
Ā 
Wi 2015 demo_preview
Wi 2015 demo_previewWi 2015 demo_preview
Wi 2015 demo_preview
Ā 
An overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support SystemAn overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support System
Ā 

Recently uploaded

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
Ā 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
Ā 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
Ā 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
Ā 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
Ā 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
Ā 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
Ā 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
Ā 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
Ā 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
Ā 
Call Girls in Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls in Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”Call Girls in Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls in Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”soniya singh
Ā 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
Ā 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionBoston Institute of Analytics
Ā 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
Ā 
Delhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Callshivangimorya083
Ā 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
Ā 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
Ā 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
Ā 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
Ā 
Full night šŸ„µ Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy āœŒļøo...
Full night šŸ„µ Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy āœŒļøo...Full night šŸ„µ Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy āœŒļøo...
Full night šŸ„µ Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy āœŒļøo...shivangimorya083
Ā 

Recently uploaded (20)

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
Ā 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
Ā 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
Ā 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
Ā 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Ā 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Ā 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
Ā 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Ā 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
Ā 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Ā 
Call Girls in Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls in Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”Call Girls in Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls in Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Ā 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
Ā 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
Ā 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
Ā 
Delhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Ā 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
Ā 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
Ā 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
Ā 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
Ā 
Full night šŸ„µ Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy āœŒļøo...
Full night šŸ„µ Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy āœŒļøo...Full night šŸ„µ Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy āœŒļøo...
Full night šŸ„µ Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy āœŒļøo...
Ā 

Information retrieval concept, practice and challenge

  • 1. Information Retrieval Concept, Practice and Challenge GUEST LECTURE BY GAN KENG HOON 14 APRIL 2016 SCHOOL OF COMPUTER SCIENCES, UNIVERSITI SAINS MALAYSIA. 1
  • 2. Outlines Concept Practice Challenge1. Conceptual Model 2. Retrieval Unit 3. Document Representation 4. Information Needs 5. Indexing 6. Retrieval Functions 7. Evaluation 1. Search Engine 1. Cross Lingual IR 2. Big Data 3. Personalization 4. Domain Specific IR 5. ā€¦ā€¦ 2
  • 4. A Conceptual Model for IR Documents Document Representation Information Needs Query Retrieved Documents Indexing Formulation Retrieval Function Relevance Feedback 4
  • 5. Definitions of IR ā€œInformation retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.ā€ (Salton, 1968) Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). (Manning, 2008) 5
  • 6. Document/Retrieval Unit ā—¦ Web pages, email, books, news stories, scholarly papers, text messages, Wordā„¢, Powerpointā„¢, PDF, forum postings, patents, etc. ā—¦ Retrieval unit can be ā—¦ Part of document, e.g. a paragraph, a slide, a page etc. ā—¦ In the form different structure, html, xml, text etc. ā—¦ In different sizes/length. 6
  • 7. Document Representation Full Text Representation ā—¦ Keep everything. Complete. ā—¦ Require huge resources. Too much may not be good. Reduced (partial) Content Representation ā—¦ Remove not important contents e.g. stopwords. ā—¦ Standardization to reduce overlapped contents e.g. stemming. ā—¦ Retain only important contents, e.g. noun phrases, header etc. 7
  • 8. Document Representation Think of representation as some ways of storing the document. Bag of Words Model Store the words as the bag (multiset) of its words, disregarding grammar and even word order. Document 1: "The cat sat on the hat" Document 2: "The dog ate the cat and the hat" From these two documents, a word list is constructed: { the, cat, sat, on, hat, dog, ate, and } The list has 8 distinct words. Document 1: { 2, 1, 1, 1, 1, 0, 0, 0 } Document 2 : { 3, 1, 0, 0, 1, 1, 1, 1} 8
  • 9. Information Needs Those things that you want Google to give you answer are information needs. Example of my search history ā—¦ Query: weka text classification ā—¦ Information Need: I want to find the tutorial about using weka for text classification. ā—¦ Query: Dell i7 laptop ā—¦ Information Need: I want to find the information any dell laptop that runs on intel i7 processor. Actually, I want to buy, so an online store would be relevant. 9
  • 10. Information Needs Normally, you are required to formulate your information needs into some keywords, known as query. Simple Query ā—¦ Few keywords or more. Boolean Query ā—¦ ā€˜neural network AND speech recognitionā€™ Special Query ā—¦ 400 myr in usd 10
  • 11. Retrieved Documents From the original collection, a subset of documents are obtained. What is the factor that determines what document to return? Simple Term Matching Approach 1. Compare the terms in a document and query. 2. Compute ā€œsimilarityā€ between each document in the collection and the query based on the terms they have in common. 3. Sorting the document in order of decreasing similarity with the query. 4. The outputs are a ranked list and displayed to the user - the top ones are more relevant as judged by the system. 11
  • 12. Indexing Convert documents into representation or data structure to improve the efficiency of retrieval. To generate a set of useful terms called indexes. Why? ā—¦ Many variety of words used in texts, but not all are important. ā—¦ Among the important words, some are more contextually relevant. Some basic processes involved ā—¦ Tokenization ā—¦ Stop Words Removal ā—¦ Stemming ā—¦ Phrases ā—¦ Inverted File 12
  • 13. Indexing (Tokenization) Convert a sequence of characters into a sequence of tokens with some basic meaning. ā€œThe cat chases the mouse.ā€ ā€œBigcorp's 2007 bi-annual report showed profits rose 10%.ā€ the cat chases the mouse bigcorp 2007 bi annual report showed profits rose 10% 13
  • 14. Indexing (Tokenization) Token can be single or multiple terms. ā€œSamsung Galaxy S7 Edge, redefines what a phone can do.ā€ samsung galaxy s7 edge redefines what a phone can do samsung galaxy s7 edge redefines what a ā€¦. or 14
  • 15. Indexing (Tokenization) Common Issues 1. Capitalized words can have different meaning from lower case words ā—¦ Bush fires the officer. Query: Bush fire ā—¦ The bush fire lasted for 3 days. Query: bush fire 2. Apostrophes can be a part of a word, a part of a possessive, or just a mistake ā—¦ rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats, master's degree, england's ten largest cities, shriner's 15
  • 16. Indexing (Tokenization) 3. Numbers can be important, including decimals ā—¦ nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the beat, 288358 4. Periods can occur in numbers, abbreviations, URLs, ends of sentences, and other situations ā—¦ I.B.M., Ph.D., cs.umass.edu, F.E.A.R. Note: tokenizing steps for queries must be identical to steps for documents 16
  • 17. Indexing (Stopping) Top 50 Words fromAP89 News Collection Recall, Indexes should be useful term links to a document. Are the terms on the right figure useful? 17
  • 18. Indexing (Stopping) Stopword list can be created from high-frequency words or based on a standard list Lists are customized for applications, domains, and even parts of documents ā—¦ e.g., ā€œclickā€ is a good stopword for anchor text Best policy is to index all words in documents, make decisions about which words to use at query time 18
  • 19. Indexing (Stemming) Many morphological variations of words ā—¦ inflectional (plurals, tenses) ā—¦ derivational (making verbs nouns etc.) In most cases, these have the same or very similar meanings Stemmers attempt to reduce morphological variations of words to a common stem ā—¦ usually involves removing suffixes Can be done at indexing time or as part of query processing (like stopwords) 19
  • 20. Indexing (Stemming) Porter Stemmer ā—¦ Algorithmic stemmer used in IR experiments since the 70s ā—¦ Consists of a series of rules designed to the longest possible suffix at each step ā—¦ Produces stems not words ā—¦ Example Step 1 (right figure) 20
  • 22. Indexing (Phrases) Recall, token, meaningful tokens are better indexes, e.g. phrases. Text processing issue ā€“ how are phrases recognized? Three possible approaches: ā—¦ Identify syntactic phrases using a part-of-speech (POS) tagger ā—¦ Use word n-grams ā—¦ Store word positions in indexes and use proximity operators in queries 22
  • 23. Indexing (Phrases) POS taggers use statistical models of text to predict syntactic tags of words ā—¦ Example tags: ā—¦ NN (singular noun), NNS (plural noun), VB (verb), VBD (verb, past tense), VBN (verb, past participle), IN (preposition), JJ (adjective), CC (conjunction, e.g., ā€œandā€, ā€œorā€), PRP (pronoun), and MD (modal auxiliary, e.g., ā€œcanā€, ā€œwillā€). Phrases can then be defined as simple noun groups, for example 23
  • 25. Indexing (Phrases) Example Noun Phrases * Other method like N-Gram 25
  • 26. Indexing (Inverted Index) Recall, indexes are designed to support search. Each index term is associated with an inverted list ā—¦ Contains lists of documents, or lists of word occurrences in documents, and other information. ā—¦ Each entry is called a posting. ā—¦ The part of the posting that refers to a specific document or location is called a pointer ā—¦ Each document in the collection is given a unique number ā—¦ Lists are usually document-ordered (sorted by document number) 26
  • 27. Indexing (Inverted Index) Sample collection. 4 sentences from Wikipedia entry forTropical Fish 27
  • 28. Indexing (Inverted Index) Simple inverted index. 28
  • 29. Indexing (Inverted Index) Inverted index with counts. Support better ranking algorithms. 29
  • 30. Indexing (Inverted Index) Inverted index with positions. Support proximity matching. 30
  • 31. Retrieval Function Ranking Documents are retrieved in sorted order according to a score computing using the document representation, the query, and a ranking algorithm 31
  • 32. Retrieval Function (Boolean Retrieval) Advantages ā—¦ Results are predictable, relatively easy to explain ā—¦ Many different features can be incorporated ā—¦ Efficient processing since many documents can be eliminated from search Disadvantages ā—¦ Effectiveness depends entirely on user ā—¦ Simple queries usually donā€™t work well ā—¦ Complex queries are difficult 32
  • 33. Retrieval Function (Boolean Retrieval) Sequence of queries driven by number of retrieved documents ā—¦ e.g. ā€œlincolnā€ search of news articles ā—¦ president AND lincoln ā—¦ president AND lincoln AND NOT (automobile OR car) ā—¦ president AND lincoln AND biography AND life AND birthplace AND gettysburg AND NOT (automobile OR car) ā—¦ president AND lincoln AND (biography OR life OR birthplace OR gettysburg) AND NOT (automobile OR car) 33
  • 34. Retrieval Function (Vector Space Model) Ranked based method. Documents and query represented by a vector of term weights Collection represented by a matrix of term weights 34
  • 35. Retrieval Function (Vector Space Model) borneo daily new north straits times D1 0 0 1 0 1 1 D2 0 1 1 0 1 0 D3 1 0 0 1 0 1 D1: new straits times D2: new straits daily D3 : north borneo times Vector of terms 35
  • 36. Retrieval Function (Vector Space Model) borneo daily new north straits times D1 0 0 0.176 0 0.176 0.176 D2 0 0.477 0.176 0 0.176 0 D3 0.477 0 0 0.477 0 0.176 idf (borneo) = log(3/1) =0.477 idf (daily) = log(3/1) = 0.477 idf (new) = log(3/2) =0.176 idf (north) = log(3/1) = 0.477 idf (straits) = log(3/2) = 0.176 idf (times) = log(3/2) = 0.176 then multiply by tf tf.idf weight Term frequency weight measures importance in document: Inverse document frequency measures importance in collection: 36
  • 37. Retrieval Function (Vector Space Model) Documents ranked by distance between points representing query and documents ā—¦ Similarity measure more common than a distance or dissimilarity measure ā—¦ e.g. Cosine correlation 37
  • 38. Retrieval Function (Vector Space Model) Consider two documents D1, D2 and a query Q Q = ā€œstraits timesā€ Compare against collection, (borneo, daily, new, north, straits, times) Q = (0, 0, 0, 0, 0.176, 0.176) D1 = (0, 0, 0.176, 0, 0.176, 0.176) D2 = (0, 0.477, 0.176, 0, 0.176, 0) š¶š¶š¶š¶š¶š¶š¶š¶š¶š¶š¶š¶ š·š·š·, š‘„š‘„ = 0āˆ—0 + 0āˆ—0 + 0.176āˆ—0 + 0āˆ—0.176 + 0.176āˆ—0.176 +(0.176āˆ—0.176) 0.1762 +0.1762 +0.1762 (0.1762 +0.1762 ) =0.816 Find Cosine (D2,Q). Which document is more relevant? 38
  • 39. Evaluation A must to evaluate the retrieval function, preprocessing steps etc. Standard Collection ā—¦ Task specific ā—¦ Human experts are used to judge relevant results. Performance Metric ā—¦ Precision ā—¦ Recall 39
  • 40. Evaluation (Collection) Test collectionsconsisting of documents, queries, and relevance judgments, e.g., 40
  • 42. Evaluation (Collection) Obtaining relevance judgments is an expensive, time-consuming process ā—¦ who does it? ā—¦ what are the instructions? ā—¦ what is the level of agreement? 42
  • 43. Evaluation (Collection) Exhaustive judgments for all documents in a collection is not practical Pooling technique is used in TREC ā—¦ top k results (for TREC, k varied between 50 and 200) from the rankings obtained by different search engines (or retrieval algorithms) are merged into a pool ā—¦ duplicates are removed ā—¦ documents are presented in some random order to the relevance judges Produces a large number of relevance judgments for each query, although still incomplete 43
  • 44. Evaluation (Effectiveness Measures) A is set of relevant documents, B is set of retrieved documents 44
  • 47. Search Engine The most relevant application of Information Retrieval. Do you agree? Search on the Web is a daily activity for many people throughout the world. 47
  • 48. What about Database Records Database records (or tuples in relational databases) are typically made up of well-defined fields (or attributes) ā—¦ e.g., bank records with account numbers, balances, names, addresses, social security numbers, dates of birth, etc. Easy to compare fields with well-defined semantics to queries in order to find matches Text is more difficult 48
  • 49. Search Query vs DB Query Example bank database query ā—¦ Find records with balance > $50,000 in branches located in Amherst, MA. ā—¦ Matches easily found by comparison with field values of records Example search engine query ā—¦ bank scandals in western mass ā—¦ This text must be compared to the text of entire news stories 49
  • 50. Dimensions of IR IR is more than just text, and more than just web search ā—¦ although these are central People doing IR work with different media, different types of search applications, and different tasks 50
  • 51. Other Media New applications increasingly involve new media ā—¦ e.g., video, photos, music, speech Like text, content is difficult to describe and compare ā—¦ text may be used to represent them (e.g. tags) IR approaches to search and evaluation are appropriate 51
  • 52. Dimensions of IR Content Applications Tasks Text Web search Ad hoc search Images Vertical search Filtering Video Enterprise search Classification Scanned docs Desktop search Question answering Audio Forum search Music P2P search Literature search 52
  • 53. IR and Search Engines Relevance -Effective ranking Evaluation -Testing and measuring Information needs -User interaction Performance -Efficient search and indexing Incorporating new data -Coverage and freshness Scalability -Growing with data and users Adaptability -Tuning for applications Specific problems -e.g. Spam Information Retrieval Search Engines 53
  • 54. Search Engine Issues Performance ā—¦ Measuring and improving the efficiency of search ā—¦ e.g., reducing response time, increasing query throughput, increasing indexing speed ā—¦ Indexes are data structures designed to improve search efficiency ā—¦ designing and implementing them are major issues for search engines 54
  • 55. Search Engine Issues Dynamic data ā—¦ The ā€œcollectionā€ for most real applications is constantly changing in terms of updates, additions, deletions ā—¦ e.g., web pages ā—¦ Acquiring or ā€œcrawlingā€ the documents is a major task ā—¦ Typical measures are coverage (how much has been indexed) and freshness (how recently was it indexed) ā—¦ Updating the indexes while processing queries is also a design issue 55
  • 56. Search Engine Issues Scalability ā—¦ Making everything work with millions of users every day, and many terabytes of documents ā—¦ Distributed processing is essential Adaptability ā—¦ Changing and tuning search engine components such as ranking algorithm, indexing strategy, interface for different applications 56
  • 58. Letā€™s Define the Challenge Together 1. Cross Lingual IR 2. Big Data 3. Personalization 4. Domain Specific IR 5. Multi modal IR 6. ā€¦. 58
  • 59. IR Research Direction Latest Research at Google http://research.google.com/pubs/In formationRetrievalandtheWeb.html 59
  • 60. Acknowledgement Thank you for Watching ā€¦ This presentation is prepared with some adapted contents from the following sources. a. Introduction to Information Retrieval, C. D. Manning, P. Raghavan and H. Schutze, 2009. b. Introduction to Information Retrieval, IR Summer School 2001, Mounia, Lalmas. c. Search Engines: Information Retrieval in Practice, D. Metzler, T. Strohman and W. B. Croft, 2009. 60