SlideShare a Scribd company logo
1 of 19
STAY CONNECTED
Twitter @activate_conf
Facebook @activateconf
#Activate19
Log in to wifi, follow Activate on social media,
and download the event app where you can
submit an evaluation after the session
WIFI NETWORK: Activate2019
PASSWORD: Lucidworks
DOWNLOAD THE ACTIVATE 2019 MOBILE APP
Search Activate2019 in the App/Play store
Or visit: http://crowd.cc/activate19
Reduce Query Time up to 60% with
Selective Search
Rajani Maski
Lucidworks
Professional Services
ABSTRACT
This talk will present a technique to improve search relevance and query performance by
dividing collections into topic shards and search is exclusively executed across ranked
shards.This concept is based on cluster hypothesis which states documents in the same
cluster behave similarly wrt relevance to information needs, and this is researched in
academics by Kulkarni A, Callan J as Selective Search.
Takeaway
This talk will outline the latest search techniques, present the experimental setup and
conclude with evident empirical results.
Intended Audience
Experience in Search and Machine Learning.
Github page at https://github.com/rajanim/selective-search
● Brief on large dataset search applications
● Current implementation and shortfalls
● Researched implementation
● Experimental setup & details
● Results & References
Agenda
Datasets of size in terabytes
Search applications that deliver swift and interactive search and
meets high standards in terms of quality results.
Brief on large dataset search applications
Current implementation
Distributed Information
Retrieval(DIR) Architecture -
Exhaustive Search
Computing resources are outlined to
hold a division of dataset and this
subdivision is known as “shard”. At
query time, each shard is presumed
to handle query independently.
This architecture has been handling
incredible volume of search queries
that are in order of few billions per
day.
Shortfalls
Computation costs incurred to search exhaustively across every
partition(shard) of such a large collection.
Researched implementation
Divide large collection into subjected(topic) shards and search
exclusively across ranked shards.
Motive
Avoid exhaustive search — that is search across every shard.
Concept
Idea is based on cluster hypothesis which states documents in
the same cluster behave similarly wrt relevant information
needs.
Literature Review
Researched in academics by Kulkarni A, Callan J as Selective
Search[1][3].
Clustering results
Word cloud view of each shard
20newsgroup dataset
Click here to view
Job Posts dataset
Click here to view
Researched implementation details
Generate a clustering ML model
based of some percentage of
dataset
Route documents to shards based
on content analogy(topic based
partition) yielding subjected shards
Search exclusively across ranked
shards
Selective Search
algorithms(ReDDe[3], CORI[4] and
LTR[6])
Researched implementation details
Clustering algorithm(s)
KMeans with uniform random sampling
KMeans with vocabulary based rejection
Selective Search algorithms
ReDDe(Relevant Document Distribution Estimation[3]) - Build a central sample index of docs
chosen by uniformly sampling documents per clustered shard, query against this index to
decide on top ranked shards.
CORI (Collection Relevance Inference Networks[4]) Build an index of unique terms with
shard association, TF and DF of terms per shard and calculate the score of shards per query
to rank shards.
LTR(Learning to Rank[5]) Build an LTR model based of TF, DF, TFIDF, BM25 vectors and
make use of the model to rank shards for given query.
Apache Solr “implicit” routing to distribute documents to respective cluster shard
Experimental Setup
● Apache Spark and its MLlib to generate
topical shards and parallel computing.
● Apache Solr libraries for Search and
Information Retrieval. Employing the
“implicit” type, documents are routed to
shards based on content analogy
● Spark-Solr lib contributed by Lucidworks to
read from and write to solr
● Selective Search algorithms(ReDDe[3],
CORI[4] and LTR[5])
● Experimented Dataset - 20newsgroup[6],
BBC[8], Clueweb[7] and Jobs[9]
https://github.com/rajanim/selective-search#implementation-architecture
Experimental Setup
● Hardware Specs
○ 32g RAM, 250g Flash Storage, 2.2 Ghz,
Intel core i7
● Solr Cloud
○ Version 6.2.1, 7.x (2 nodes, 50 shards)
● Spark Cluster Version 2.0.0
○ Standalone Cluster, 1 Master 2g, 2
Workers 6g
● Scala Version 2.11.8, Java 1.8.0_31
Experimented Results
Quantitative Results
● Part of clueweb dataset [6] (5 million in total)
● Pre-train model
○ 500k news articles, Number of shards 50
○ N=32k Dimensions(feature terms)
○ K=5 times iterations
○ Min document freq 100
● Time taken to pretrain model 1 hour 12 minutes
Experimented Results
Query Time Results
Experimented Results
Qualitative Analysis
Trec Evaluation on Clueweb dataset[6]
Experimental Results
Cluster distributionof 20newsgroup
dataset[7] that was collapsed to single
directory for test
https://github.com/rajanim/selective-
search/blob/master/benchmarks/20newsgroup/output/20Newsgroup_dataset_kmeans_cluster_
allocation_results.pdf
Word cloud of each cluster(shard)
https://github.com/rajanim/selective-
search/blob/master/benchmarks/20newsgroup/output/20newsgroup_word_cloud_clusters.pdf
References
[1] Anagha Kulkarni. 2015. Selective Search: Efficient and Effective Large scale Search. ACM
Transactions on Information Systems, 33(4). ACM. 2015.
[2]Anagha Kulkarni. 2010. Topic-based Index Partitions for Efficient and Effective Selective
Search. 8th Workshop on Large-Scale Distributed Systems for Information Retrieval.
[3] Luo Si and Jamie Callan. Relevant document distribution estimation method for resource
selection. In Proceedings of the SIGIR Conference, pages 298–305. ACM, 2003.
[4] James Callan, Zhihong Lu, and Bruce Croft. Searching distributed collections with inference
networks. In Proceedings of the SIGIR Conference, pages 21–28. ACM, 1995.
[5]Chuang M.* and Kulkarni A. (2017) Improving Shard Selection for Selective Search. In the Proceedings of the
Asia Information Retrieval Societies Conference. November 2017. Jeju, Korea.
[6] Clueweb09 dataset. Lemur Project.
[7] 20Newsgroups. Jrennie. qwone.com/~jason/20Newsgroups
[8] BBC dataset. http://mlg.ucd.ie/datasets/bbc.html
[9] Job Posts dataset. https://www.kaggle.com/madhab/jobposts/data
https://github.com/rajanim/selective-search/
THANK YOU

More Related Content

What's hot

What's hot (20)

Exploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, LucidworksExploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, Lucidworks
 
H2O World - Clustering & Feature Extraction on Text - Seth Redmore
H2O World - Clustering & Feature Extraction on Text - Seth RedmoreH2O World - Clustering & Feature Extraction on Text - Seth Redmore
H2O World - Clustering & Feature Extraction on Text - Seth Redmore
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
 
What Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryWhat Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS Library
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)
 
Leveraging Graphs for Better AI
Leveraging Graphs for Better AILeveraging Graphs for Better AI
Leveraging Graphs for Better AI
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Vespa, A Tour
Vespa, A TourVespa, A Tour
Vespa, A Tour
 
How we (Almost) Forgot Lambda Architecture and used Elasticsearch
How we (Almost) Forgot Lambda Architecture and used ElasticsearchHow we (Almost) Forgot Lambda Architecture and used Elasticsearch
How we (Almost) Forgot Lambda Architecture and used Elasticsearch
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engine
 
Personalized Re-Ranking of Documents
Personalized Re-Ranking of DocumentsPersonalized Re-Ranking of Documents
Personalized Re-Ranking of Documents
 
How Graph Technology is Changing AI
How Graph Technology is Changing AIHow Graph Technology is Changing AI
How Graph Technology is Changing AI
 
Recommendation engine
Recommendation engineRecommendation engine
Recommendation engine
 
Webinar: Building Conversational Search with Fusion
Webinar: Building Conversational Search with FusionWebinar: Building Conversational Search with Fusion
Webinar: Building Conversational Search with Fusion
 
Big data testing (1)
Big data testing (1)Big data testing (1)
Big data testing (1)
 
Solr 6.0 Graph Query Overview
Solr 6.0 Graph Query OverviewSolr 6.0 Graph Query Overview
Solr 6.0 Graph Query Overview
 
It's Just Search: Presented by Erik Hatcher, Lucidworks
It's Just Search: Presented by Erik Hatcher, LucidworksIt's Just Search: Presented by Erik Hatcher, Lucidworks
It's Just Search: Presented by Erik Hatcher, Lucidworks
 
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
 
Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's New
 
Webinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphWebinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and Graph
 

Similar to Reduce Query Time Up to 60% with Selective Search

Recommendation system using unsupervised machine learning algorithm & assoc
Recommendation system using unsupervised machine learning algorithm & assocRecommendation system using unsupervised machine learning algorithm & assoc
Recommendation system using unsupervised machine learning algorithm & assoc
ijerd
 
03 cs3024 pankaj_jajoo
03 cs3024 pankaj_jajoo03 cs3024 pankaj_jajoo
03 cs3024 pankaj_jajoo
Meetika Gupta
 
Data Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and FutureData Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and Future
feiwin
 
Paper id 37201536
Paper id 37201536Paper id 37201536
Paper id 37201536
IJRAT
 
Testing survey by_directions
Testing survey by_directionsTesting survey by_directions
Testing survey by_directions
Tao He
 

Similar to Reduce Query Time Up to 60% with Selective Search (20)

IRJET- Data Mining - Secure Keyword Manager
IRJET- Data Mining - Secure Keyword ManagerIRJET- Data Mining - Secure Keyword Manager
IRJET- Data Mining - Secure Keyword Manager
 
Recommendation system using unsupervised machine learning algorithm & assoc
Recommendation system using unsupervised machine learning algorithm & assocRecommendation system using unsupervised machine learning algorithm & assoc
Recommendation system using unsupervised machine learning algorithm & assoc
 
clustering_classification.ppt
clustering_classification.pptclustering_classification.ppt
clustering_classification.ppt
 
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
 
Searching Repositories of Web Application Models
Searching Repositories of Web Application ModelsSearching Repositories of Web Application Models
Searching Repositories of Web Application Models
 
T0 numtq0n tk=
T0 numtq0n tk=T0 numtq0n tk=
T0 numtq0n tk=
 
CloWSer
CloWSerCloWSer
CloWSer
 
Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUs
 
Softwear presentation
Softwear presentationSoftwear presentation
Softwear presentation
 
A FAIR Approach to Publishing and Sharing Machine Learning Models
A FAIR Approach to Publishing and Sharing Machine Learning ModelsA FAIR Approach to Publishing and Sharing Machine Learning Models
A FAIR Approach to Publishing and Sharing Machine Learning Models
 
03 cs3024 pankaj_jajoo
03 cs3024 pankaj_jajoo03 cs3024 pankaj_jajoo
03 cs3024 pankaj_jajoo
 
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety
 
Apache Solr vs Oracle Endeca
Apache Solr vs Oracle EndecaApache Solr vs Oracle Endeca
Apache Solr vs Oracle Endeca
 
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondHierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyond
 
Data Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and FutureData Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and Future
 
Paper review
Paper reviewPaper review
Paper review
 
Qui Quaerit, Reperit. AWS Elasticsearch in Action
Qui Quaerit, Reperit. AWS Elasticsearch in ActionQui Quaerit, Reperit. AWS Elasticsearch in Action
Qui Quaerit, Reperit. AWS Elasticsearch in Action
 
Paper id 37201536
Paper id 37201536Paper id 37201536
Paper id 37201536
 
Focused Crawling System based on Improved LSI
Focused Crawling System based on Improved LSIFocused Crawling System based on Improved LSI
Focused Crawling System based on Improved LSI
 
Testing survey by_directions
Testing survey by_directionsTesting survey by_directions
Testing survey by_directions
 

More from Lucidworks

Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 

More from Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Reduce Query Time Up to 60% with Selective Search

  • 1.
  • 2. STAY CONNECTED Twitter @activate_conf Facebook @activateconf #Activate19 Log in to wifi, follow Activate on social media, and download the event app where you can submit an evaluation after the session WIFI NETWORK: Activate2019 PASSWORD: Lucidworks DOWNLOAD THE ACTIVATE 2019 MOBILE APP Search Activate2019 in the App/Play store Or visit: http://crowd.cc/activate19
  • 3. Reduce Query Time up to 60% with Selective Search Rajani Maski Lucidworks Professional Services ABSTRACT This talk will present a technique to improve search relevance and query performance by dividing collections into topic shards and search is exclusively executed across ranked shards.This concept is based on cluster hypothesis which states documents in the same cluster behave similarly wrt relevance to information needs, and this is researched in academics by Kulkarni A, Callan J as Selective Search. Takeaway This talk will outline the latest search techniques, present the experimental setup and conclude with evident empirical results. Intended Audience Experience in Search and Machine Learning. Github page at https://github.com/rajanim/selective-search
  • 4. ● Brief on large dataset search applications ● Current implementation and shortfalls ● Researched implementation ● Experimental setup & details ● Results & References Agenda
  • 5. Datasets of size in terabytes Search applications that deliver swift and interactive search and meets high standards in terms of quality results. Brief on large dataset search applications
  • 6. Current implementation Distributed Information Retrieval(DIR) Architecture - Exhaustive Search Computing resources are outlined to hold a division of dataset and this subdivision is known as “shard”. At query time, each shard is presumed to handle query independently. This architecture has been handling incredible volume of search queries that are in order of few billions per day.
  • 7. Shortfalls Computation costs incurred to search exhaustively across every partition(shard) of such a large collection.
  • 8. Researched implementation Divide large collection into subjected(topic) shards and search exclusively across ranked shards. Motive Avoid exhaustive search — that is search across every shard. Concept Idea is based on cluster hypothesis which states documents in the same cluster behave similarly wrt relevant information needs. Literature Review Researched in academics by Kulkarni A, Callan J as Selective Search[1][3].
  • 9. Clustering results Word cloud view of each shard 20newsgroup dataset Click here to view Job Posts dataset Click here to view
  • 10. Researched implementation details Generate a clustering ML model based of some percentage of dataset Route documents to shards based on content analogy(topic based partition) yielding subjected shards Search exclusively across ranked shards Selective Search algorithms(ReDDe[3], CORI[4] and LTR[6])
  • 11. Researched implementation details Clustering algorithm(s) KMeans with uniform random sampling KMeans with vocabulary based rejection Selective Search algorithms ReDDe(Relevant Document Distribution Estimation[3]) - Build a central sample index of docs chosen by uniformly sampling documents per clustered shard, query against this index to decide on top ranked shards. CORI (Collection Relevance Inference Networks[4]) Build an index of unique terms with shard association, TF and DF of terms per shard and calculate the score of shards per query to rank shards. LTR(Learning to Rank[5]) Build an LTR model based of TF, DF, TFIDF, BM25 vectors and make use of the model to rank shards for given query. Apache Solr “implicit” routing to distribute documents to respective cluster shard
  • 12. Experimental Setup ● Apache Spark and its MLlib to generate topical shards and parallel computing. ● Apache Solr libraries for Search and Information Retrieval. Employing the “implicit” type, documents are routed to shards based on content analogy ● Spark-Solr lib contributed by Lucidworks to read from and write to solr ● Selective Search algorithms(ReDDe[3], CORI[4] and LTR[5]) ● Experimented Dataset - 20newsgroup[6], BBC[8], Clueweb[7] and Jobs[9] https://github.com/rajanim/selective-search#implementation-architecture
  • 13. Experimental Setup ● Hardware Specs ○ 32g RAM, 250g Flash Storage, 2.2 Ghz, Intel core i7 ● Solr Cloud ○ Version 6.2.1, 7.x (2 nodes, 50 shards) ● Spark Cluster Version 2.0.0 ○ Standalone Cluster, 1 Master 2g, 2 Workers 6g ● Scala Version 2.11.8, Java 1.8.0_31
  • 14. Experimented Results Quantitative Results ● Part of clueweb dataset [6] (5 million in total) ● Pre-train model ○ 500k news articles, Number of shards 50 ○ N=32k Dimensions(feature terms) ○ K=5 times iterations ○ Min document freq 100 ● Time taken to pretrain model 1 hour 12 minutes
  • 16. Experimented Results Qualitative Analysis Trec Evaluation on Clueweb dataset[6]
  • 17. Experimental Results Cluster distributionof 20newsgroup dataset[7] that was collapsed to single directory for test https://github.com/rajanim/selective- search/blob/master/benchmarks/20newsgroup/output/20Newsgroup_dataset_kmeans_cluster_ allocation_results.pdf Word cloud of each cluster(shard) https://github.com/rajanim/selective- search/blob/master/benchmarks/20newsgroup/output/20newsgroup_word_cloud_clusters.pdf
  • 18. References [1] Anagha Kulkarni. 2015. Selective Search: Efficient and Effective Large scale Search. ACM Transactions on Information Systems, 33(4). ACM. 2015. [2]Anagha Kulkarni. 2010. Topic-based Index Partitions for Efficient and Effective Selective Search. 8th Workshop on Large-Scale Distributed Systems for Information Retrieval. [3] Luo Si and Jamie Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the SIGIR Conference, pages 298–305. ACM, 2003. [4] James Callan, Zhihong Lu, and Bruce Croft. Searching distributed collections with inference networks. In Proceedings of the SIGIR Conference, pages 21–28. ACM, 1995. [5]Chuang M.* and Kulkarni A. (2017) Improving Shard Selection for Selective Search. In the Proceedings of the Asia Information Retrieval Societies Conference. November 2017. Jeju, Korea. [6] Clueweb09 dataset. Lemur Project. [7] 20Newsgroups. Jrennie. qwone.com/~jason/20Newsgroups [8] BBC dataset. http://mlg.ucd.ie/datasets/bbc.html [9] Job Posts dataset. https://www.kaggle.com/madhab/jobposts/data https://github.com/rajanim/selective-search/