Reduce Query Time Up to 60% with Selective Search

STAY CONNECTED
Twitter @activate_conf
Facebook @activateconf
#Activate19
Log in to wifi, follow Activate on social media,
and download the event app where you can
submit an evaluation after the session
WIFI NETWORK: Activate2019
PASSWORD: Lucidworks
DOWNLOAD THE ACTIVATE 2019 MOBILE APP
Search Activate2019 in the App/Play store
Or visit: http://crowd.cc/activate19

Reduce Query Time up to 60% with
Selective Search
Rajani Maski
Lucidworks
Professional Services
ABSTRACT
This talk will present a technique to improve search relevance and query performance by
dividing collections into topic shards and search is exclusively executed across ranked
shards.This concept is based on cluster hypothesis which states documents in the same
cluster behave similarly wrt relevance to information needs, and this is researched in
academics by Kulkarni A, Callan J as Selective Search.
Takeaway
This talk will outline the latest search techniques, present the experimental setup and
conclude with evident empirical results.
Intended Audience
Experience in Search and Machine Learning.
Github page at https://github.com/rajanim/selective-search

● Brief on large dataset search applications
● Current implementation and shortfalls
● Researched implementation
● Experimental setup & details
● Results & References
Agenda

Datasets of size in terabytes
Search applications that deliver swift and interactive search and
meets high standards in terms of quality results.
Brief on large dataset search applications

Current implementation
Distributed Information
Retrieval(DIR) Architecture -
Exhaustive Search
Computing resources are outlined to
hold a division of dataset and this
subdivision is known as “shard”. At
query time, each shard is presumed
to handle query independently.
This architecture has been handling
incredible volume of search queries
that are in order of few billions per
day.

Shortfalls
Computation costs incurred to search exhaustively across every
partition(shard) of such a large collection.

Researched implementation
Divide large collection into subjected(topic) shards and search
exclusively across ranked shards.
Motive
Avoid exhaustive search — that is search across every shard.
Concept
Idea is based on cluster hypothesis which states documents in
the same cluster behave similarly wrt relevant information
needs.
Literature Review
Researched in academics by Kulkarni A, Callan J as Selective
Search[1][3].

Clustering results
Word cloud view of each shard
20newsgroup dataset
Click here to view
Job Posts dataset
Click here to view

Researched implementation details
Generate a clustering ML model
based of some percentage of
dataset
Route documents to shards based
on content analogy(topic based
partition) yielding subjected shards
Search exclusively across ranked
shards
Selective Search
algorithms(ReDDe[3], CORI[4] and
LTR[6])

Researched implementation details
Clustering algorithm(s)
KMeans with uniform random sampling
KMeans with vocabulary based rejection
Selective Search algorithms
ReDDe(Relevant Document Distribution Estimation[3]) - Build a central sample index of docs
chosen by uniformly sampling documents per clustered shard, query against this index to
decide on top ranked shards.
CORI (Collection Relevance Inference Networks[4]) Build an index of unique terms with
shard association, TF and DF of terms per shard and calculate the score of shards per query
to rank shards.
LTR(Learning to Rank[5]) Build an LTR model based of TF, DF, TFIDF, BM25 vectors and
make use of the model to rank shards for given query.
Apache Solr “implicit” routing to distribute documents to respective cluster shard

Experimental Setup
● Apache Spark and its MLlib to generate
topical shards and parallel computing.
● Apache Solr libraries for Search and
Information Retrieval. Employing the
“implicit” type, documents are routed to
shards based on content analogy
● Spark-Solr lib contributed by Lucidworks to
read from and write to solr
● Selective Search algorithms(ReDDe[3],
CORI[4] and LTR[5])
● Experimented Dataset - 20newsgroup[6],
BBC[8], Clueweb[7] and Jobs[9]
https://github.com/rajanim/selective-search#implementation-architecture

Experimental Setup
● Hardware Specs
○ 32g RAM, 250g Flash Storage, 2.2 Ghz,
Intel core i7
● Solr Cloud
○ Version 6.2.1, 7.x (2 nodes, 50 shards)
● Spark Cluster Version 2.0.0
○ Standalone Cluster, 1 Master 2g, 2
Workers 6g
● Scala Version 2.11.8, Java 1.8.0_31

Experimented Results
Quantitative Results
● Part of clueweb dataset [6] (5 million in total)
● Pre-train model
○ 500k news articles, Number of shards 50
○ N=32k Dimensions(feature terms)
○ K=5 times iterations
○ Min document freq 100
● Time taken to pretrain model 1 hour 12 minutes

Query Time Results

Qualitative Analysis
Trec Evaluation on Clueweb dataset[6]

Experimental Results
Cluster distributionof 20newsgroup
dataset[7] that was collapsed to single
directory for test
https://github.com/rajanim/selective-
search/blob/master/benchmarks/20newsgroup/output/20Newsgroup_dataset_kmeans_cluster_
allocation_results.pdf
Word cloud of each cluster(shard)
https://github.com/rajanim/selective-
search/blob/master/benchmarks/20newsgroup/output/20newsgroup_word_cloud_clusters.pdf

References
[1] Anagha Kulkarni. 2015. Selective Search: Efficient and Effective Large scale Search. ACM
Transactions on Information Systems, 33(4). ACM. 2015.
[2]Anagha Kulkarni. 2010. Topic-based Index Partitions for Efficient and Effective Selective
Search. 8th Workshop on Large-Scale Distributed Systems for Information Retrieval.
[3] Luo Si and Jamie Callan. Relevant document distribution estimation method for resource
selection. In Proceedings of the SIGIR Conference, pages 298–305. ACM, 2003.
[4] James Callan, Zhihong Lu, and Bruce Croft. Searching distributed collections with inference
networks. In Proceedings of the SIGIR Conference, pages 21–28. ACM, 1995.
[5]Chuang M.* and Kulkarni A. (2017) Improving Shard Selection for Selective Search. In the Proceedings of the
Asia Information Retrieval Societies Conference. November 2017. Jeju, Korea.
[6] Clueweb09 dataset. Lemur Project.
[7] 20Newsgroups. Jrennie. qwone.com/~jason/20Newsgroups
[8] BBC dataset. http://mlg.ucd.ie/datasets/bbc.html
[9] Job Posts dataset. https://www.kaggle.com/madhab/jobposts/data
https://github.com/rajanim/selective-search/

Reduce Query Time Up to 60% with Selective Search

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Reduce Query Time Up to 60% with Selective Search

Similar to Reduce Query Time Up to 60% with Selective Search (20)

More from Lucidworks

More from Lucidworks (20)

Recently uploaded

Recently uploaded (20)

Reduce Query Time Up to 60% with Selective Search