Building a Meta-search Engine

Building a Meta - Search Engine
Information Retrieval
CS60092

Mentor:
Suman Kalyan Maity
Project Members:
Ayan Chandra, CS, 16CS72P02
Sandeep Sharma, MI, 13MI31025
Ankita Saha, AT, 16AT72P01
Vineet Jain, ME, 15ME30044
Indrasekhar Sengupta, RJ, 16RJ72P01
Sudeshna Das, ET, 16ET91R01
Github: https://github.com/metasearchengine/metarank

Introduction
● A meta-search engine (MSE) is an aggregator search service which uses data from a set
of search engines to produce its own results from the internet, given a query from the
user interface.
● It takes input from a user and simultaneously send the queries to third party search
engine APIs , and on receiving sufficient data, formats by its re-ranker and presents to
the user.

Objective
To build an experimental meta-search engine
Key areas :
● Meta-search infrastructure.
● Meta-ranking or rank aggregation.

Query Set
A set of 100 queries are
selected to be the benchmark
query set.
1. Ten queries for distinct keywords search
2. Ten queries for phrase word search
3. Ten queries for appended keywords search
4. Ten queries for words related to named entities `
e.g. persons
5. Ten queries for keywords related to trending
topics
6. Ten queries for keywords related to news
7. Ten queries for keywords for video, specifically
youtube
8. Ten queries for product search
9. Ten queries for rare search
10. Ten queries for keywords related to weather

Query Type Query phrases/words
Appended Keyword Java , Java programming , Java programming tutorial ...
Distinct Keyword Cricket score , CM of UP , Latest Hollywood movies ...
Phrase word The bewildered tourist , Knowing what i know now ….
Named entity Sachin Tendulkar , Cormen , Coorg , Elon Musk ….
Trending keywords IPL , Donald Trump , ISIS , Yogi Adityanath , Space-X ...
News Indian News , Delhi MCD Election , Dalai Lama visit ….
Video keyword Latest songs , DBMS Lectures , Latest Movie Trailers ...
Product Phone charger , Earphones , Books , IPad , Watch ...
Rare keyword Philanthropists , Anthropology , Serendipity , Gynecologist ...
Weather query Today’s weather , Weather on 1’st January , Temperature ….

Query Pre - Processing
On-demand module
• Word limits in search engines
• Ensures that important words
are not lost
• Module is triggered for large
queries only (# of words > 10)
• Avoids unnecessary pre-
processing
• Terminological noun phrase
extraction using a large
corpus

Algo: Keyphrase Extraction
Input: Query q
Output: Keyphrases
1. Perform POS tagging on query q
2. Extract terminological noun phrases by
using regular expression patterns
3. Filter noun phrases by using a large web
corpus
4. Return keyphrases
Assumption: If length of q is above a certain
threshold, it is likely to be a well-formed
sentence(s).
● POS tagging: NLTK toolkit
● Regular expressions:
○ P1 = C*N
○ P2 = (C*NP)?(C*N)
○ P3 = A*N+
■ N = noun,
■ P = preposition,
■ A = adjective, C = A|N

Example
Query where can i find a real example of a very long search engine query
POS tagging where/WRB can/MD i/VB find/VB a/DT real/JJ example/NN of/IN a/DT
very/RB long/JJ search/NN engine/NN query/NN
Regular expression
filtering
real example, long search engine query
Web corpus filtering real example, long search engine query
Output real example long search engine query

Caching
• Issues :
• Freshness of the result for the
identical query
• How long and How much query we
will keep in the cache
• Benefits ?

Pre-threading
management
• The new query has been
tagged or identified to be
from a particular topic or
genre ?
• The system is not able to
receive response from all the
considered search engine
APIs within a certain
threshold time limit

Query limit and API
Key management
• For a single API Key we can
utilize 1000 queries for free
• A pool of API keys is
generated for each of the
search engine API

Threading Module
Multithreading is the ability of a central processing unit (CPU) or a
single core in a multicore processor to execute multiple processes or
threads concurrently.

Meta rank module
1. For a query Q, the set of identical results
provided by different search engine APIs to be
re-ranked.
2. If most of the search engines vote that result i
has better result index or rank than result j,
then result i is assumed to be better than j.
3. Concept of alpha majority is a better approach
if we have large number of search engines.

X= {0,1} , set of possible opinions. nX → The number of rankings which give the opinion x belongs to X
Total number of ranking → N. 0<=alpha<=0.5, 0<=beta<=1 .
Ranking k has disagreed with the alpha-majority iff the following conditions are satisfied:
1. n0+n1 >= ceil ( beta * N) ……. eq (1)
2. nx(k) < alpha * (n0 + n1 ) …….. eq (2)
Weight assignment rule:
Wl = 1 - delta / |S|C2 …...eq (3)
Wl → fraction of item pairs for which an input ranking Rl agrees with alpha majority.
where , delta = 0, if Rl does not disagree with alpha-majority for (i,j)
= 1, if Rl disagrees with alpha-majority for (i,j)
= 0.5, if both i and j are not ranked by Rl
|S| → the number of distinct items that appear in the input rankings.
The opinion of a ranker is incorrect if it fails to agree with a fraction alpha of rankers that rank both
the items. [Alpha Majority]

Query & Response Log
Analysis
Three phases :
1.Collection,
2.Preparation
3.Analysis
• Collection : Query responses in Json
format
• Preparation :
a. Importing log data to NoSQL format
b. Cleaning
c. Log Format: JSON, CSV
d. Log Database: MongoDB
• Analysis :
a. Term level Analysis
b. Query level Analysis
c. Search Engine specific Analysis

IR Evaluation
• Mean Average Precision
• Recall
• Precision-Recall Ratio

Query Type Mean Average Precision Recall Precision/Recall Ratio
Appended Keyword 3.67 6 0.61
Distinct Keyword 4.49 7 0.64
Phrase word 4.33 7 0.62
Named entity 5.67 8 0.71
Trending keywords 6.11 6 1.01
News 6.69 6 1.12
Video keyword 6.54 6 1.09
Product 4.14 6 0.69
Rare keyword 6.76 8 0.85
Weather query 7.14 5 1.43

1. Core technology, weightage: 5%
2. Scalability, weightage: 10%
3. Search time, weightage: 20%
4. Query functionality, weightage: 10%
5. Search relevance, weightage: 50%
Our rating as per the system: 4+7+12+7+38=68 out of 100
Metric for overall performance

Thank You
References:
M.S. Desarkar, S. Sarkar, P. Mitra: Preference
relations based unsupervised rank aggregation for
meta-search. Expert Systems With Applications 49
(2016) 86-98
Manning, Christopher D., Mihai Surdeanu, John
Bauer, Jenny Finkel, Steven J. Bethard, and David
McClosky. 2014. The Stanford CoreNLP Natural
Language Processing Toolkit In Proceedings of the
52nd Annual Meeting of the Association for
Computational Linguistics: System Demonstrations,
pp. 55-60.

Building a Meta-search Engine

More Related Content

What's hot

Similar to Building a Meta-search Engine

Recently uploaded

Building a Meta-search Engine