SlideShare a Scribd company logo
1 of 15
Download to read offline
Query Expansion
Cluster based using N Grams
UMA K L (201305514)
SPANDAN VEGGALAM (201307674)
MAHAVER CHOPRA (201101011)
AKSHAT KANDELWAL (201001095)
INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY -HYDERABAD
Query Expansion
• Key feature of Search Engine
• In many cases it is difficult to find the search intent of user
• Users do not always formulate query in the best way
• Query recommendation is to help users in formulating queries to certain extent
• Improves the search retrieval performance, user selects the alternate query input from
suggestions which is relevant to his intent.
• Increases recall of Information Retrieval System
Expanding Queries
Following techniques are used for expanding queries
1. Spell Corrections
2. Finding and searching with “Synonyms” of input query terms
3. Augmenting query with terms
In our approach we focus on Augmenting queries and Searching with synonyms
Our Approach
Two Phases
1. Offline Phase
1. Add Synonyms to documents
2. Cluster the documents, in order to group similar documents into a cluster
3. Index and Label the clusters
Only nouns are indexed.
2. Online Phase
1. Search for clusters as Phrase query
2. Predict words for query augmentation
3. Re-weight the query and suggest top queries as query recommendations
Only Nouns are considered as augmented words
Our Approach – Offline Phase
Why Clustering?
1. Clustering improves the scope of suggesting queries for different contexts
2. Documents are clustered together, and indexed
3. Search is performed on cluster index.
4. Relevant clusters are considered to find augmented terms
5. Top N query suggestions from each cluster are considered
Clustering Parameters Used
Algorithm: K-Means
Number of Clusters: 150
Our Approach – Offline Phase
Adding Synonyms
1. Allows user to search with synonyms as well
2. Ideally system should accept synonyms and is expected to retrieve same relevant documents
3. Top 5% of words from each document are considered, and synonyms are added to these words
Labeling Clusters
1. Clusters are tagged with most relevant terms
2. Label contain set of terms which can distinguish it from other clusters
Our Approach – Online Phase
1. Retrieve relevant Clusters for given input query.
2. Select top ‘N’ Clusters
3. If given query can be represented in N Grams
1. As the words are sequential and from same document, intent of user is clear. Next word in
document can be suggested as augment word
2. Retrieve next sequential word from the cluster, which is set of documents
3. Augment the query with these predicted words, retrieve top queries are present the user as query
recommendations
Our Approach – Online Phase
4. Else if the query terms are separated with some distance
1. Predict next word for each input term, add terms to a list.
2. Identify the tags for clusters, and them to list. Tags are words that gives information about the input
words together
3. Here user intent is not clear so, tags which gives category/topic/context of the document are also
considered for augmentation
4. Augment the query with these predicted words, retrieve top queries are present the user as query
recommendations
5. If the given words are far from each other, it is very difficult to co-relate each word
1. Sequential words cannot be used as augment words, the input words may be from different
contexts and is hard to retrieve relevant documents thereby reduces the recall
2. Cluster tags are words which gives some information about the input words together, and are
considered as augmented terms
Architecture – Offline Phase
Architecture – Online Phase
Tools & Data Set Used
Tools
◦ Word Net: Used to identify synonyms
◦ Cluto: Clusters the documents
◦ Doc2Mat: Represents documents in matrix format
◦ Apache Lucene: Used for indexing and querying.
◦ Stanford NLP POS tagger: Identifying Part of speech of word
Data Set
◦ Data set consists of Telegraph Calcutta news paper stories
◦ Stories are categorized into
◦ “FrontPage”, “Nation, “Calcutta”, “Bengal”, “Foreign”, “Business”, “Sports”, “Opinion”, “Metro”
◦ Format of each story in data set is
◦ <DocNo>*</DocNo><Text>*</Text>
Evaluation
We have run the augmented queries over the data set to retrieve relevant documents and found
considerable increase in recall and precision values.
Following bar diagram gives change in precision and recall values for random augmented queries
formulated over 30 input queries.
Evaluation
Open this file for evaluation results of all augmented queries for 30 input queries.
Future Work
1. Approach can be extended to implement query logs
◦ Query logs can be used as knowledge base for suggesting queries, and also helps in asynchronous way of
suggesting queries
2. User preferences can be used to filter the documents based on relevancy
3. Different Mixed versions of Markov models can be used to achieve the best balance among accuracy
and coverage both in terms of data (objective) and user (subjective) centric evaluation metrics
4. Different N gram variations can be used to make it ideally suitable for real time Search engine
Conclusion
1. We have explained our approach for query expansions, which uses Clustering method to
extend suggestions from various contexts, N gram and Markov model to determine augment
terms
2. We have applied sequential probabilistic model as it is suitable for the task of online query
recommendation
3. Achieved accuracy and coverage in terms of data.
4. Time and memory complexities of our application is measured and found it is suitable for
real time search engine

More Related Content

What's hot

Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
Data Quality Integration (ETL) Open Source
Data Quality Integration (ETL) Open SourceData Quality Integration (ETL) Open Source
Data Quality Integration (ETL) Open SourceStratebi
 
Data Quality Best Practices
Data Quality Best PracticesData Quality Best Practices
Data Quality Best PracticesDATAVERSITY
 
Anomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleAnomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleImpetus Technologies
 
Data Quality
Data QualityData Quality
Data QualityVijaya K
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profilingShailja Khurana
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science ProcessVishal Patel
 
Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernAmin Chowdhury
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Matteo Manca
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and workAmr Abd El Latief
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 

What's hot (20)

Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data science
Data scienceData science
Data science
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data lineage
Data lineageData lineage
Data lineage
 
Data Quality Integration (ETL) Open Source
Data Quality Integration (ETL) Open SourceData Quality Integration (ETL) Open Source
Data Quality Integration (ETL) Open Source
 
Data science
Data scienceData science
Data science
 
Data Quality Best Practices
Data Quality Best PracticesData Quality Best Practices
Data Quality Best Practices
 
Anomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleAnomaly detection with machine learning at scale
Anomaly detection with machine learning at scale
 
Data Quality
Data QualityData Quality
Data Quality
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science Process
 
Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing Concern
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning
 
Pandas
PandasPandas
Pandas
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 

Viewers also liked

Designing Search For Humans
Designing Search For HumansDesigning Search For Humans
Designing Search For Humansmarti_hearst
 
Automatic suggestion of query-rewrite rules for enterprise search
Automatic suggestion of query-rewrite rules for enterprise searchAutomatic suggestion of query-rewrite rules for enterprise search
Automatic suggestion of query-rewrite rules for enterprise searchYunyao Li
 
Query Recommendations in the Long Tail: Efficient and Effective Techniques.
Query Recommendations in the Long Tail: Efficient and Effective Techniques.Query Recommendations in the Long Tail: Efficient and Effective Techniques.
Query Recommendations in the Long Tail: Efficient and Effective Techniques.Fabrizio Silvestri
 
Learning by example: training users through high-quality query suggestions
Learning by example: training users through high-quality query suggestionsLearning by example: training users through high-quality query suggestions
Learning by example: training users through high-quality query suggestionsClaudia Hauff
 
Instant search - A hands-on tutorial
Instant search  - A hands-on tutorialInstant search  - A hands-on tutorial
Instant search - A hands-on tutorialGanesh Venkataraman
 
Query Formulation in Multilingual Web Information Retrieval
Query Formulation in Multilingual Web Information RetrievalQuery Formulation in Multilingual Web Information Retrieval
Query Formulation in Multilingual Web Information RetrievalMuhammad Rizwan Pasha
 
Query Understanding: A Manifesto
Query Understanding: A ManifestoQuery Understanding: A Manifesto
Query Understanding: A ManifestoDaniel Tunkelang
 
Techniques For Deep Query Understanding
Techniques For Deep Query UnderstandingTechniques For Deep Query Understanding
Techniques For Deep Query UnderstandingAbhay Prakash
 

Viewers also liked (8)

Designing Search For Humans
Designing Search For HumansDesigning Search For Humans
Designing Search For Humans
 
Automatic suggestion of query-rewrite rules for enterprise search
Automatic suggestion of query-rewrite rules for enterprise searchAutomatic suggestion of query-rewrite rules for enterprise search
Automatic suggestion of query-rewrite rules for enterprise search
 
Query Recommendations in the Long Tail: Efficient and Effective Techniques.
Query Recommendations in the Long Tail: Efficient and Effective Techniques.Query Recommendations in the Long Tail: Efficient and Effective Techniques.
Query Recommendations in the Long Tail: Efficient and Effective Techniques.
 
Learning by example: training users through high-quality query suggestions
Learning by example: training users through high-quality query suggestionsLearning by example: training users through high-quality query suggestions
Learning by example: training users through high-quality query suggestions
 
Instant search - A hands-on tutorial
Instant search  - A hands-on tutorialInstant search  - A hands-on tutorial
Instant search - A hands-on tutorial
 
Query Formulation in Multilingual Web Information Retrieval
Query Formulation in Multilingual Web Information RetrievalQuery Formulation in Multilingual Web Information Retrieval
Query Formulation in Multilingual Web Information Retrieval
 
Query Understanding: A Manifesto
Query Understanding: A ManifestoQuery Understanding: A Manifesto
Query Understanding: A Manifesto
 
Techniques For Deep Query Understanding
Techniques For Deep Query UnderstandingTechniques For Deep Query Understanding
Techniques For Deep Query Understanding
 

Similar to Query expansion

The sarcasm detection with the method of logistic regression
The sarcasm detection with the method of logistic regressionThe sarcasm detection with the method of logistic regression
The sarcasm detection with the method of logistic regressionEditorIJAERD
 
Adaptive focused crawling strategy for maximising the relevance
Adaptive focused crawling strategy for maximising the relevanceAdaptive focused crawling strategy for maximising the relevance
Adaptive focused crawling strategy for maximising the relevanceeSAT Journals
 
Ontology Based Approach for Semantic Information Retrieval System
Ontology Based Approach for Semantic Information Retrieval SystemOntology Based Approach for Semantic Information Retrieval System
Ontology Based Approach for Semantic Information Retrieval SystemIJTET Journal
 
Performance Evaluation of Query Processing Techniques in Information Retrieval
Performance Evaluation of Query Processing Techniques in Information RetrievalPerformance Evaluation of Query Processing Techniques in Information Retrieval
Performance Evaluation of Query Processing Techniques in Information Retrievalidescitation
 
Improving search result via search keywords and data classification similarity
Improving search result via search keywords and data classification similarityImproving search result via search keywords and data classification similarity
Improving search result via search keywords and data classification similarityConference Papers
 
Query formulation process
Query formulation processQuery formulation process
Query formulation processmalathimurugan
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...Kumar Goud
 
Efficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity rankingEfficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity rankingShakas Technologies
 
Using data mining methods knowledge discovery for text mining
Using data mining methods knowledge discovery for text miningUsing data mining methods knowledge discovery for text mining
Using data mining methods knowledge discovery for text miningeSAT Publishing House
 
Using data mining methods knowledge discovery for text mining
Using data mining methods knowledge discovery for text miningUsing data mining methods knowledge discovery for text mining
Using data mining methods knowledge discovery for text miningeSAT Journals
 
IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...
IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...
IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...IRJET Journal
 
Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...
Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...
Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...IRJET Journal
 
A Survey on Automatically Mining Facets for Queries from their Search Results
A Survey on Automatically Mining Facets for Queries from their Search ResultsA Survey on Automatically Mining Facets for Queries from their Search Results
A Survey on Automatically Mining Facets for Queries from their Search ResultsIRJET Journal
 
IRJET- Analysis of Question and Answering Recommendation System
IRJET-  	  Analysis of Question and Answering Recommendation SystemIRJET-  	  Analysis of Question and Answering Recommendation System
IRJET- Analysis of Question and Answering Recommendation SystemIRJET Journal
 
An Efficient Approach for Keyword Selection ; Improving Accessibility of Web ...
An Efficient Approach for Keyword Selection ; Improving Accessibility of Web ...An Efficient Approach for Keyword Selection ; Improving Accessibility of Web ...
An Efficient Approach for Keyword Selection ; Improving Accessibility of Web ...dannyijwest
 
Missing Value Evaluation in SQL Queries: A Survey
Missing Value Evaluation in SQL Queries: A SurveyMissing Value Evaluation in SQL Queries: A Survey
Missing Value Evaluation in SQL Queries: A SurveyIRJET Journal
 
IRJET- Missing Value Evaluation in SQL Queries: A Survey
IRJET- 	  Missing Value Evaluation in SQL Queries: A SurveyIRJET- 	  Missing Value Evaluation in SQL Queries: A Survey
IRJET- Missing Value Evaluation in SQL Queries: A SurveyIRJET Journal
 

Similar to Query expansion (20)

The sarcasm detection with the method of logistic regression
The sarcasm detection with the method of logistic regressionThe sarcasm detection with the method of logistic regression
The sarcasm detection with the method of logistic regression
 
Adaptive focused crawling strategy for maximising the relevance
Adaptive focused crawling strategy for maximising the relevanceAdaptive focused crawling strategy for maximising the relevance
Adaptive focused crawling strategy for maximising the relevance
 
Ontology Based Approach for Semantic Information Retrieval System
Ontology Based Approach for Semantic Information Retrieval SystemOntology Based Approach for Semantic Information Retrieval System
Ontology Based Approach for Semantic Information Retrieval System
 
Performance Evaluation of Query Processing Techniques in Information Retrieval
Performance Evaluation of Query Processing Techniques in Information RetrievalPerformance Evaluation of Query Processing Techniques in Information Retrieval
Performance Evaluation of Query Processing Techniques in Information Retrieval
 
Improving search result via search keywords and data classification similarity
Improving search result via search keywords and data classification similarityImproving search result via search keywords and data classification similarity
Improving search result via search keywords and data classification similarity
 
Query formulation process
Query formulation processQuery formulation process
Query formulation process
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
 
Efficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity rankingEfficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity ranking
 
Using data mining methods knowledge discovery for text mining
Using data mining methods knowledge discovery for text miningUsing data mining methods knowledge discovery for text mining
Using data mining methods knowledge discovery for text mining
 
Using data mining methods knowledge discovery for text mining
Using data mining methods knowledge discovery for text miningUsing data mining methods knowledge discovery for text mining
Using data mining methods knowledge discovery for text mining
 
Naresh sharma
Naresh sharmaNaresh sharma
Naresh sharma
 
IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...
IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...
IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...
 
H04564550
H04564550H04564550
H04564550
 
Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...
Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...
Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...
 
A Survey on Automatically Mining Facets for Queries from their Search Results
A Survey on Automatically Mining Facets for Queries from their Search ResultsA Survey on Automatically Mining Facets for Queries from their Search Results
A Survey on Automatically Mining Facets for Queries from their Search Results
 
IRJET- Analysis of Question and Answering Recommendation System
IRJET-  	  Analysis of Question and Answering Recommendation SystemIRJET-  	  Analysis of Question and Answering Recommendation System
IRJET- Analysis of Question and Answering Recommendation System
 
An Efficient Approach for Keyword Selection ; Improving Accessibility of Web ...
An Efficient Approach for Keyword Selection ; Improving Accessibility of Web ...An Efficient Approach for Keyword Selection ; Improving Accessibility of Web ...
An Efficient Approach for Keyword Selection ; Improving Accessibility of Web ...
 
Bv31491493
Bv31491493Bv31491493
Bv31491493
 
Missing Value Evaluation in SQL Queries: A Survey
Missing Value Evaluation in SQL Queries: A SurveyMissing Value Evaluation in SQL Queries: A Survey
Missing Value Evaluation in SQL Queries: A Survey
 
IRJET- Missing Value Evaluation in SQL Queries: A Survey
IRJET- 	  Missing Value Evaluation in SQL Queries: A SurveyIRJET- 	  Missing Value Evaluation in SQL Queries: A Survey
IRJET- Missing Value Evaluation in SQL Queries: A Survey
 

Recently uploaded

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 

Query expansion

  • 1. Query Expansion Cluster based using N Grams UMA K L (201305514) SPANDAN VEGGALAM (201307674) MAHAVER CHOPRA (201101011) AKSHAT KANDELWAL (201001095) INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY -HYDERABAD
  • 2. Query Expansion • Key feature of Search Engine • In many cases it is difficult to find the search intent of user • Users do not always formulate query in the best way • Query recommendation is to help users in formulating queries to certain extent • Improves the search retrieval performance, user selects the alternate query input from suggestions which is relevant to his intent. • Increases recall of Information Retrieval System
  • 3. Expanding Queries Following techniques are used for expanding queries 1. Spell Corrections 2. Finding and searching with “Synonyms” of input query terms 3. Augmenting query with terms In our approach we focus on Augmenting queries and Searching with synonyms
  • 4. Our Approach Two Phases 1. Offline Phase 1. Add Synonyms to documents 2. Cluster the documents, in order to group similar documents into a cluster 3. Index and Label the clusters Only nouns are indexed. 2. Online Phase 1. Search for clusters as Phrase query 2. Predict words for query augmentation 3. Re-weight the query and suggest top queries as query recommendations Only Nouns are considered as augmented words
  • 5. Our Approach – Offline Phase Why Clustering? 1. Clustering improves the scope of suggesting queries for different contexts 2. Documents are clustered together, and indexed 3. Search is performed on cluster index. 4. Relevant clusters are considered to find augmented terms 5. Top N query suggestions from each cluster are considered Clustering Parameters Used Algorithm: K-Means Number of Clusters: 150
  • 6. Our Approach – Offline Phase Adding Synonyms 1. Allows user to search with synonyms as well 2. Ideally system should accept synonyms and is expected to retrieve same relevant documents 3. Top 5% of words from each document are considered, and synonyms are added to these words Labeling Clusters 1. Clusters are tagged with most relevant terms 2. Label contain set of terms which can distinguish it from other clusters
  • 7. Our Approach – Online Phase 1. Retrieve relevant Clusters for given input query. 2. Select top ‘N’ Clusters 3. If given query can be represented in N Grams 1. As the words are sequential and from same document, intent of user is clear. Next word in document can be suggested as augment word 2. Retrieve next sequential word from the cluster, which is set of documents 3. Augment the query with these predicted words, retrieve top queries are present the user as query recommendations
  • 8. Our Approach – Online Phase 4. Else if the query terms are separated with some distance 1. Predict next word for each input term, add terms to a list. 2. Identify the tags for clusters, and them to list. Tags are words that gives information about the input words together 3. Here user intent is not clear so, tags which gives category/topic/context of the document are also considered for augmentation 4. Augment the query with these predicted words, retrieve top queries are present the user as query recommendations 5. If the given words are far from each other, it is very difficult to co-relate each word 1. Sequential words cannot be used as augment words, the input words may be from different contexts and is hard to retrieve relevant documents thereby reduces the recall 2. Cluster tags are words which gives some information about the input words together, and are considered as augmented terms
  • 11. Tools & Data Set Used Tools ◦ Word Net: Used to identify synonyms ◦ Cluto: Clusters the documents ◦ Doc2Mat: Represents documents in matrix format ◦ Apache Lucene: Used for indexing and querying. ◦ Stanford NLP POS tagger: Identifying Part of speech of word Data Set ◦ Data set consists of Telegraph Calcutta news paper stories ◦ Stories are categorized into ◦ “FrontPage”, “Nation, “Calcutta”, “Bengal”, “Foreign”, “Business”, “Sports”, “Opinion”, “Metro” ◦ Format of each story in data set is ◦ <DocNo>*</DocNo><Text>*</Text>
  • 12. Evaluation We have run the augmented queries over the data set to retrieve relevant documents and found considerable increase in recall and precision values. Following bar diagram gives change in precision and recall values for random augmented queries formulated over 30 input queries.
  • 13. Evaluation Open this file for evaluation results of all augmented queries for 30 input queries.
  • 14. Future Work 1. Approach can be extended to implement query logs ◦ Query logs can be used as knowledge base for suggesting queries, and also helps in asynchronous way of suggesting queries 2. User preferences can be used to filter the documents based on relevancy 3. Different Mixed versions of Markov models can be used to achieve the best balance among accuracy and coverage both in terms of data (objective) and user (subjective) centric evaluation metrics 4. Different N gram variations can be used to make it ideally suitable for real time Search engine
  • 15. Conclusion 1. We have explained our approach for query expansions, which uses Clustering method to extend suggestions from various contexts, N gram and Markov model to determine augment terms 2. We have applied sequential probabilistic model as it is suitable for the task of online query recommendation 3. Achieved accuracy and coverage in terms of data. 4. Time and memory complexities of our application is measured and found it is suitable for real time search engine