Building a Meta - Search Engine
Information Retrieval
CS60092
Mentor:
Suman Kalyan Maity
Project Members:
Ayan Chandra, CS, 16CS72P02
Sandeep Sharma, MI, 13MI31025
Ankita Saha, AT, 16AT72P01
Vineet Jain, ME, 15ME30044
Indrasekhar Sengupta, RJ, 16RJ72P01
Sudeshna Das, ET, 16ET91R01
Github: https://github.com/metasearchengine/metarank
Introduction
● A meta-search engine (MSE) is an aggregator search service which uses data from a set
of search engines to produce its own results from the internet, given a query from the
user interface.
● It takes input from a user and simultaneously send the queries to third party search
engine APIs , and on receiving sufficient data, formats by its re-ranker and presents to
the user.
Objective
To build an experimental meta-search engine
Key areas :
● Meta-search infrastructure.
● Meta-ranking or rank aggregation.
System Module & Methodology
Infrastructure
Query Set
A set of 100 queries are
selected to be the benchmark
query set.
1. Ten queries for distinct keywords search
2. Ten queries for phrase word search
3. Ten queries for appended keywords search
4. Ten queries for words related to named entities `
e.g. persons
5. Ten queries for keywords related to trending
topics
6. Ten queries for keywords related to news
7. Ten queries for keywords for video, specifically
youtube
8. Ten queries for product search
9. Ten queries for rare search
10. Ten queries for keywords related to weather
Query Type Query phrases/words
Appended Keyword Java , Java programming , Java programming tutorial ...
Distinct Keyword Cricket score , CM of UP , Latest Hollywood movies ...
Phrase word The bewildered tourist , Knowing what i know now ….
Named entity Sachin Tendulkar , Cormen , Coorg , Elon Musk ….
Trending keywords IPL , Donald Trump , ISIS , Yogi Adityanath , Space-X ...
News Indian News , Delhi MCD Election , Dalai Lama visit ….
Video keyword Latest songs , DBMS Lectures , Latest Movie Trailers ...
Product Phone charger , Earphones , Books , IPad , Watch ...
Rare keyword Philanthropists , Anthropology , Serendipity , Gynecologist ...
Weather query Today’s weather , Weather on 1’st January , Temperature ….
Query Pre - Processing
On-demand module
• Word limits in search engines
• Ensures that important words
are not lost
• Module is triggered for large
queries only (# of words > 10)
• Avoids unnecessary pre-
processing
• Terminological noun phrase
extraction using a large
corpus
Algo: Keyphrase Extraction
Input: Query q
Output: Keyphrases
1. Perform POS tagging on query q
2. Extract terminological noun phrases by
using regular expression patterns
3. Filter noun phrases by using a large web
corpus
4. Return keyphrases
Assumption: If length of q is above a certain
threshold, it is likely to be a well-formed
sentence(s).
● POS tagging: NLTK toolkit
● Regular expressions:
○ P1 = C*N
○ P2 = (C*NP)?(C*N)
○ P3 = A*N+
■ N = noun,
■ P = preposition,
■ A = adjective, C = A|N
Example
Query where can i find a real example of a very long search engine query
POS tagging where/WRB can/MD i/VB find/VB a/DT real/JJ example/NN of/IN a/DT
very/RB long/JJ search/NN engine/NN query/NN
Regular expression
filtering
real example, long search engine query
Web corpus filtering real example, long search engine query
Output real example long search engine query
Caching
• Issues :
• Freshness of the result for the
identical query
• How long and How much query we
will keep in the cache
• Benefits ?
Pre-threading
management
• The new query has been
tagged or identified to be
from a particular topic or
genre ?
• The system is not able to
receive response from all the
considered search engine
APIs within a certain
threshold time limit
Query limit and API
Key management
• For a single API Key we can
utilize 1000 queries for free
• A pool of API keys is
generated for each of the
search engine API
Threading Module
Multithreading is the ability of a central processing unit (CPU) or a
single core in a multicore processor to execute multiple processes or
threads concurrently.
Meta rank module
1. For a query Q, the set of identical results
provided by different search engine APIs to be
re-ranked.
2. If most of the search engines vote that result i
has better result index or rank than result j,
then result i is assumed to be better than j.
3. Concept of alpha majority is a better approach
if we have large number of search engines.
X= {0,1} , set of possible opinions. nX → The number of rankings which give the opinion x belongs to X
Total number of ranking → N. 0<=alpha<=0.5, 0<=beta<=1 .
Ranking k has disagreed with the alpha-majority iff the following conditions are satisfied:
1. n0+n1 >= ceil ( beta * N) ……. eq (1)
2. nx(k) < alpha * (n0 + n1 ) …….. eq (2)
Weight assignment rule:
Wl = 1 - delta / |S|C2 …...eq (3)
Wl → fraction of item pairs for which an input ranking Rl agrees with alpha majority.
where , delta = 0, if Rl does not disagree with alpha-majority for (i,j)
= 1, if Rl disagrees with alpha-majority for (i,j)
= 0.5, if both i and j are not ranked by Rl
|S| → the number of distinct items that appear in the input rankings.
The opinion of a ranker is incorrect if it fails to agree with a fraction alpha of rankers that rank both
the items. [Alpha Majority]
Query & Response Log
Analysis
Three phases :
1.Collection,
2.Preparation
3.Analysis
• Collection : Query responses in Json
format
• Preparation :
a. Importing log data to NoSQL format
b. Cleaning
c. Log Format: JSON, CSV
d. Log Database: MongoDB
• Analysis :
a. Term level Analysis
b. Query level Analysis
c. Search Engine specific Analysis
IR Evaluation
• Mean Average Precision
• Recall
• Precision-Recall Ratio
Query Type Mean Average Precision Recall Precision/Recall Ratio
Appended Keyword 3.67 6 0.61
Distinct Keyword 4.49 7 0.64
Phrase word 4.33 7 0.62
Named entity 5.67 8 0.71
Trending keywords 6.11 6 1.01
News 6.69 6 1.12
Video keyword 6.54 6 1.09
Product 4.14 6 0.69
Rare keyword 6.76 8 0.85
Weather query 7.14 5 1.43
Recall vs MAP
1. Core technology, weightage: 5%
2. Scalability, weightage: 10%
3. Search time, weightage: 20%
4. Query functionality, weightage: 10%
5. Search relevance, weightage: 50%
Our rating as per the system: 4+7+12+7+38=68 out of 100
Metric for overall performance
User Interface
Query result view: example 1
Query result view: example 2
Thank You
References:
M.S. Desarkar, S. Sarkar, P. Mitra: Preference
relations based unsupervised rank aggregation for
meta-search. Expert Systems With Applications 49
(2016) 86-98
Manning, Christopher D., Mihai Surdeanu, John
Bauer, Jenny Finkel, Steven J. Bethard, and David
McClosky. 2014. The Stanford CoreNLP Natural
Language Processing Toolkit In Proceedings of the
52nd Annual Meeting of the Association for
Computational Linguistics: System Demonstrations,
pp. 55-60.

Building a Meta-search Engine

  • 1.
    Building a Meta- Search Engine Information Retrieval CS60092
  • 2.
    Mentor: Suman Kalyan Maity ProjectMembers: Ayan Chandra, CS, 16CS72P02 Sandeep Sharma, MI, 13MI31025 Ankita Saha, AT, 16AT72P01 Vineet Jain, ME, 15ME30044 Indrasekhar Sengupta, RJ, 16RJ72P01 Sudeshna Das, ET, 16ET91R01 Github: https://github.com/metasearchengine/metarank
  • 3.
    Introduction ● A meta-searchengine (MSE) is an aggregator search service which uses data from a set of search engines to produce its own results from the internet, given a query from the user interface. ● It takes input from a user and simultaneously send the queries to third party search engine APIs , and on receiving sufficient data, formats by its re-ranker and presents to the user.
  • 4.
    Objective To build anexperimental meta-search engine Key areas : ● Meta-search infrastructure. ● Meta-ranking or rank aggregation.
  • 5.
    System Module &Methodology
  • 6.
  • 7.
    Query Set A setof 100 queries are selected to be the benchmark query set. 1. Ten queries for distinct keywords search 2. Ten queries for phrase word search 3. Ten queries for appended keywords search 4. Ten queries for words related to named entities ` e.g. persons 5. Ten queries for keywords related to trending topics 6. Ten queries for keywords related to news 7. Ten queries for keywords for video, specifically youtube 8. Ten queries for product search 9. Ten queries for rare search 10. Ten queries for keywords related to weather
  • 8.
    Query Type Queryphrases/words Appended Keyword Java , Java programming , Java programming tutorial ... Distinct Keyword Cricket score , CM of UP , Latest Hollywood movies ... Phrase word The bewildered tourist , Knowing what i know now …. Named entity Sachin Tendulkar , Cormen , Coorg , Elon Musk …. Trending keywords IPL , Donald Trump , ISIS , Yogi Adityanath , Space-X ... News Indian News , Delhi MCD Election , Dalai Lama visit …. Video keyword Latest songs , DBMS Lectures , Latest Movie Trailers ... Product Phone charger , Earphones , Books , IPad , Watch ... Rare keyword Philanthropists , Anthropology , Serendipity , Gynecologist ... Weather query Today’s weather , Weather on 1’st January , Temperature ….
  • 9.
    Query Pre -Processing On-demand module • Word limits in search engines • Ensures that important words are not lost • Module is triggered for large queries only (# of words > 10) • Avoids unnecessary pre- processing • Terminological noun phrase extraction using a large corpus
  • 10.
    Algo: Keyphrase Extraction Input:Query q Output: Keyphrases 1. Perform POS tagging on query q 2. Extract terminological noun phrases by using regular expression patterns 3. Filter noun phrases by using a large web corpus 4. Return keyphrases Assumption: If length of q is above a certain threshold, it is likely to be a well-formed sentence(s). ● POS tagging: NLTK toolkit ● Regular expressions: ○ P1 = C*N ○ P2 = (C*NP)?(C*N) ○ P3 = A*N+ ■ N = noun, ■ P = preposition, ■ A = adjective, C = A|N
  • 11.
    Example Query where cani find a real example of a very long search engine query POS tagging where/WRB can/MD i/VB find/VB a/DT real/JJ example/NN of/IN a/DT very/RB long/JJ search/NN engine/NN query/NN Regular expression filtering real example, long search engine query Web corpus filtering real example, long search engine query Output real example long search engine query
  • 12.
    Caching • Issues : •Freshness of the result for the identical query • How long and How much query we will keep in the cache • Benefits ?
  • 13.
    Pre-threading management • The newquery has been tagged or identified to be from a particular topic or genre ? • The system is not able to receive response from all the considered search engine APIs within a certain threshold time limit
  • 14.
    Query limit andAPI Key management • For a single API Key we can utilize 1000 queries for free • A pool of API keys is generated for each of the search engine API
  • 15.
    Threading Module Multithreading isthe ability of a central processing unit (CPU) or a single core in a multicore processor to execute multiple processes or threads concurrently.
  • 17.
    Meta rank module 1.For a query Q, the set of identical results provided by different search engine APIs to be re-ranked. 2. If most of the search engines vote that result i has better result index or rank than result j, then result i is assumed to be better than j. 3. Concept of alpha majority is a better approach if we have large number of search engines.
  • 18.
    X= {0,1} ,set of possible opinions. nX → The number of rankings which give the opinion x belongs to X Total number of ranking → N. 0<=alpha<=0.5, 0<=beta<=1 . Ranking k has disagreed with the alpha-majority iff the following conditions are satisfied: 1. n0+n1 >= ceil ( beta * N) ……. eq (1) 2. nx(k) < alpha * (n0 + n1 ) …….. eq (2) Weight assignment rule: Wl = 1 - delta / |S|C2 …...eq (3) Wl → fraction of item pairs for which an input ranking Rl agrees with alpha majority. where , delta = 0, if Rl does not disagree with alpha-majority for (i,j) = 1, if Rl disagrees with alpha-majority for (i,j) = 0.5, if both i and j are not ranked by Rl |S| → the number of distinct items that appear in the input rankings. The opinion of a ranker is incorrect if it fails to agree with a fraction alpha of rankers that rank both the items. [Alpha Majority]
  • 19.
    Query & ResponseLog Analysis Three phases : 1.Collection, 2.Preparation 3.Analysis • Collection : Query responses in Json format • Preparation : a. Importing log data to NoSQL format b. Cleaning c. Log Format: JSON, CSV d. Log Database: MongoDB • Analysis : a. Term level Analysis b. Query level Analysis c. Search Engine specific Analysis
  • 20.
    IR Evaluation • MeanAverage Precision • Recall • Precision-Recall Ratio
  • 21.
    Query Type MeanAverage Precision Recall Precision/Recall Ratio Appended Keyword 3.67 6 0.61 Distinct Keyword 4.49 7 0.64 Phrase word 4.33 7 0.62 Named entity 5.67 8 0.71 Trending keywords 6.11 6 1.01 News 6.69 6 1.12 Video keyword 6.54 6 1.09 Product 4.14 6 0.69 Rare keyword 6.76 8 0.85 Weather query 7.14 5 1.43
  • 22.
  • 23.
    1. Core technology,weightage: 5% 2. Scalability, weightage: 10% 3. Search time, weightage: 20% 4. Query functionality, weightage: 10% 5. Search relevance, weightage: 50% Our rating as per the system: 4+7+12+7+38=68 out of 100 Metric for overall performance
  • 24.
  • 25.
  • 26.
  • 27.
    Thank You References: M.S. Desarkar,S. Sarkar, P. Mitra: Preference relations based unsupervised rank aggregation for meta-search. Expert Systems With Applications 49 (2016) 86-98 Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.