SlideShare a Scribd company logo
Efficient Query Processing in
Distributed Search Engines
Simon Jonassen
Department of Computer and Information Science
Norwegian University of Science and Technology
Outline
•  Introduction
–  Motivation
–  Research questions
–  Research process
•  Part I – Partitioned query processing
•  Part II – Skipping and pruning
•  Part III – Caching
•  Conclusions
Motivation
Popular Science, August 1937.
Motivation
•  Web search: rapid information growth and
tight performance constraints.
•  Search is used in most of Web applications.
Example: Google
•  Estimated Web index size in 2012: 35-50 billion pages.
•  More than a billion visitors in 2012.
•  36 data centers in 2008.
•  Infrastructure cost in 4Q 2011: ~ 951 million USD.
techcrunch.com
Research questions
v What are the main challenges of existing methods for
efficient query processing in distributed search engines
and how can these be resolved?
² How to improve the efficiency of
Ø  partitioned query processing
Ø  query processing and pruning
Ø  caching
Research process
•  Iterative, exploratory research:
Study/
observation
Problem/
hypothesis
Solution
proposal
Quantitative
Evaluation
Publication
of results
Outline
•  Introduction
•  Part I – Partitioned query processing
–  Introduction
–  Motivation
–  Combined semi-pipelined query processing
–  Efficient compressed skipping for disjunctive text-queries
–  Pipelined query processing with skipping
–  Intra-query concurrent pipelined processing
–  Further work
•  Part II – Skipping and pruning
•  Part III – Caching
•  Conclusions
Part I: Introduction.
Inverted index approach to IR
©apple.com
?
? ?
? ?
Part I: Introduction.
Document-wise partitioning
•  Each query is processed by all of the nodes in parallel.
•  One of the nodes merges the results.
•  Advantages:
–  Simple, fast and scalable.
–  Intra-query concurrency.
•  Challenges:
–  Each query is processed by
all of the nodes.
–  Number of disk seeks.
Part I: Introduction.
Term-wise partitioning
•  Each node fetches posting lists and sends them to one of
the nodes, which receives and processes all of them.
•  Advantages:
–  A query involves only a few nodes.
–  Smaller number of disk seeks.
–  Inter-query concurrency.
•  Challenges:
–  A single node does all processing.
–  Other nodes act just like advanced
network disks.
–  High network load.
–  Load balancing.
Part I: Introduction.
Pipelined query processing [Moffat et al. 2007]
•  A query bundle is routed from one node to next.
•  Each node contributes with its own postings.
The last node extracts the top results.
•  The maximum accumulator set size
is constrained. [Lester et al. 2005]
•  Advantages:
–  The work is divided between the nodes.
–  Reduced overhead on the last node.
–  Reduced network load.
•  Challenges:
–  Long query latency.
–  Load balancing.
Part I: Motivation.
Pipelined query processing (TP/PL)
•  Several publications find TP/PL being advantageous under
certain conditions.
–  Expensive disk access, fast network, efficient pruning, high
multiprogramming level (query load), small main memory, etc.
•  TP/PL has several advantages.
–  Number of involved nodes, inter-query concurrency, etc.
•  TP/PL may outperform DP in peak throughput.
–  Requires high multiprogramming level and results in long query latency.
•  Practical results are more pessimistic.
Ø How can we improve the method?
•  The main challenges of TP/PL [Büttcher et al. 2010]:
–  Scalability, load balancing, term/node-at-a-time processing
and lack of intra-query concurrency.
Part I: Combined semi-pipelined query processing.
Motivation
•  Pipelined – higher throughput, but longer latency.
•  Non-pipelined – shorter latency, but lower throughput.
•  We wanted to combine the advantage of both methods:
short latency AND high throughput.
Part I: Combined semi-pipelined query processing.
Contributions
ü  Semi-pipelined query processing:
1 2 1 2
Part I: Combined semi-pipelined query processing.
Contributions and evaluation
ü  Decision heuristic and alternative routing strategy.
ü  Evaluated using a modified version of Terrier v2.2
(http://terrier.org):
Ø  426GB TREC GOV2 Corpus – 25 million documents.
Ø  20 000 queries from the TREC Terabyte Track Efficiency Topics
2005 (first 10 000 queries were used as a warm-up).
Ø  8 nodes with 2x2.0GHz Intel Quad-Core, 8GB RAM and a SATA
hard-disk on each node. Gigabit network.
Part I: Combined semi-pipelined query processing.
Main results
20
40
60
80
100
120
140
200 300 400 500 600
Throughput(qps)
Latency (ms)
non-pl
plcomp
altroute+combα=0.2
Part I: Combined semi-pipelined query processing.
Post-note and motivation for other papers
•  “Inefficient” compression methods:
–  Gamma for doc ID gaps and Unary for frequencies.
•  Posting lists have to be fetched and decompressed
completely and remain in main memory until the sub-
query has been fully processed.
•  Can we do this better?
–  Modern “super-fast” index compression.
•  NewPFoR [Yan, Ding and Suel 2009].
–  Skipping.
–  Partial evaluation (pruning).
Part I: Efficient compressed skipping for disjunctive text-queries.
Contributions
ü  A self-skipping inverted index designed specifically for
NewPFoR compression.
Part I: Efficient compressed skipping for disjunctive text-queries.
Contributions
ü  Query processing methods for disjunctive (OR) queries:
Ø A skipping version of the space-limited
pruning method by Lester et al. 2007.
Ø A description of Max-Score that combines
the advantage of the descriptions by Turtle
and Flood 1995 and Strohman, Turtle and
Croft 2005.
Ø We also explained how to apply skipping and
presented updated performance results.
AND OR
latency
Part I: Efficient compressed skipping for disjunctive text-queries.
Evaluation
ü  Evaluated with a real implementation:
Ø  TREC GOV2.
•  15.4 million terms, 25.2 million docs and 4.7 billion postings.
•  Index w/o skipping is 5.977GB. Skipping adds 87.1MB.
Ø  TREC Terabyte Track Efficiency Topics 2005.
•  10 000 queries with 2+ query terms.
Ø  Workstation: 2.66GHz Intel Quad-Core 2, 8GB RAM,
1TB 7200RPM SATA2. 16KB blocks by default.
Part I: Efficient compressed skipping for disjunctive text-queries.
Main results
348	

260	

93.6	

 51.8	

 44.4	

283	

90.7	

120	

47.8	

42.2
Part I: Pipelined query processing with skipping.
Contributions
ü  Pipelined query processing with skipping.
ü  Pipelined Max-Score with DAAT sub-query processing.
ü  Posting list assignment by
maximum scores.
Part I: Pipelined query processing with skipping.
Evaluation
•  Evaluated with a real implementation:
Ø  Same index design as in the last paper, but now extended to a
distributed system.
Ø  TREC GOV2 and Terabyte Track Efficiency Topics 2005/2006:
•  5 000 queries – warm-up, 15 000 – evaluation.
Ø  9 node cluster – same as in the first paper, plus a broker.
Part I: Pipelined query processing with skipping.
Main results
Part I: Intra-query concurrent pipelined processing.
Motivation
•  At low query loads there is available CPU/network
capacity.
•  In this case, the performance
can be improved by intra-query
concurrency.
•  How can we provide intra-query
concurrency with TP/PL?
Part I: Intra-query concurrent pipelined processing.
Contributions
ü  Parallel sub-query execution on different nodes.
Ø  Fragment-based processing:
Ø Each fragment covers a
document ID range.
Ø The number of fragments in
each query depends on the
estimated processing cost.
Part I: Intra-query concurrent pipelined processing.
Contributions
ü  Concurrent sub-query execution on the same node.
Ø When the query load is low, different
fragments of the same query can be
processed concurrently.
Ø Concurrency depends on the number
of fragments in the query and the load
on the node.
Part I: Intra-query concurrent pipelined processing.
Evaluation and main results
ü  Evaluated with TREC GOV2, TREC Terabyte Track Efficiency
Topics 2005 (2 000 warm-up and 8 000 test queries).
Part I.
Further work
•  Evaluation of scalability:
–  So far: 8 nodes and 25 mil documents.
•  Further extension of the methods:
–  Dynamic load balancing.
–  Caching.
•  Open questions:
–  Term- vs document-wise partitioning.
–  Document- vs impact-ordered indexes.
Outline
•  Introduction
•  Part I – Partitioned query processing
•  Part II – Skipping and pruning
–  Contributions presented within Part I
–  Improving pruning efficiency via linear programming
•  Part III – Caching
•  Conclusions
Part II.
Contributions presented within Part I
ü  Efficient processing of disjunctive (OR) queries
(monolithic index).
ü  Pipelined query processing with skipping
(term-wise partitioned index).
Part II: Improving dynamic index pruning via linear programming.
Motivation and contributions
•  Pruning methods s.a. Max-Score and WAND rely
on tight score upper-bound estimates.
•  The upper-bound for a combination of several
terms is smaller than the sum of the individual
upper-bounds.
•  Terms and their combinations repeat.
ü  LP approach to improving the performance of
Max-Score/WAND.
Ø  Relies on LP estimation of score bounds for query
subsets.
Ø  In particular, uses upper bounds of single terms and
the most frequent term-pairs.
anna
banana
nana
0.5
0.6
0.7
0.6
1.0
1.3
Part II: Improving dynamic index pruning via linear programming.
Evaluation and extended results
ü  Evaluated with a real implementation.
Ø  20 million Web pages, 100 000 queries from Yahoo, Max-Score.
ü  The results indicate significant I/O, decompression and
computational savings for queries of 3 to 6 terms.
|q|	
   AND	
  (ms)	
   OR	
  (ms)	
  
3	
   0.002	
   0.002	
  
4	
   0.01	
   0.016	
  
5	
   0.082	
   0.136	
  
6	
   0.928	
   1.483	
  
7	
   14.794	
   22.676	
  
8	
   300.728	
   437.005	
  
|q|	
   T	
  (ms)	
   TI	
   TI**	
  
AND	
   3	
   10.759	
   1.59%	
   1.64%	
  
(k=10)	
   4	
   10.589	
   1.30%	
   1.43%	
  
5	
   9.085	
   0.09%	
   0.97%	
  
6	
   6.256	
   -­‐14.86%	
   -­‐0.95%	
  
OR	
   3	
   18.448	
   3.74%	
   3.78%	
  
(k=10)	
   4	
   24.590	
   6.85%	
   6.93%	
  
5	
   33.148	
   7.85%	
   8.24%	
  
6	
   42.766	
   6.04%	
   9.08%	
  
Latency improvements. LP solver latency.
|q|	
   %	
  
1	
   14.91%	
  
2	
   36.80%	
  
3	
   29.75%	
  
4	
   12.35%	
  
5	
   4.08%	
  
6	
   1.30%	
  
7	
   0.42%	
  
8+	
   0.39%	
  
Query length
distribution.
** 10x faster LP solver.
Part II: Improving dynamic index pruning via linear programming.
Further work
•  Larger document collection.
•  Other types of queries.
•  Reuse of previously computed scores.
•  Application to pipelined query processing.
Outline
•  Introduction
•  Part I – Partitioned query processing
•  Part II – Skipping and pruning
•  Part III – Caching
–  Modeling static caching in Web search engines
–  Prefetching query results and its impact on search engines
•  Conclusions
Result Cache
Query
Pre-processing
Query
Post-processing
Query
Processing
(Lookup)
(Hit)
(Miss)
List Cache
Index
(Miss)
(Lookup)
•  Result cache:
•  Lower hit rate.
•  Less space consuming.
•  Significant time savings.
•  List cache:
•  Higher hit rate.
•  More space consuming.
•  Less significant time savings.
•  Finding optimal split in a static two-level
cache. [Baeza-Yates et al. 2007]
•  Scenarios: local, LAN, WAN.
Part III: Modeling static caching in Web search engines.
Motivation
Part III: Modeling static caching in Web search engines.
Contributions, evaluation and further work
ü  A modeling approach that allows to find a nearly optimal
split in a static two-level cache.
Ø  Requires estimation of only six parameters.
ü  Evaluated with a cache-simulator:
Ø  37.9 million documents from UK domain (2006)
Ø  80.7 million queries (April 2006 - April 2007)
•  Further work:
–  Larger query log and document collection.
–  Improved model.
–  Other types of cache.
Part III: Prefetching query results and its impact on search engines.
Background and related work
•  Result caching:
–  Reuse of previously computed results to answer future queries.
•  Our data: About 60% of queries repeat.
–  Limited vs infinite-capacity caches.
•  Limited: Beaza-Yates et al. 2007, Gan and Suel 2009.
•  Infinite: Cambazoglu et al. 2010.
–  Cache staleness problem.
•  Results become stale after a while.
•  Time-to-live (TTL) invalidation.
–  Cambazoglu et al. 2010, Alici et al. 2012.
•  Result freshness.
–  Cambazoglu et al. 2010.
Part III: Prefetching query results and its impact on search engines.
Motivation
0 2 4 6 8 10 12 14 16 18 20 22
Hour of the day
0
5
10
15
20
25
30
35
40
Backendquerytraffic(query/sec)
TTL = 1 hour
TTL = 4 hours
TTL = 16 hours
TTL = infinite
Opportunity for
prefetching
Risk of overflow
Potential for
load reduction
Peak query throughput
sustainable by the backend
Part III: Prefetching query results and its impact on search engines.
Contributions
ü  Three-stage approach:
1.  Selection – Which query results to prefetch?
2.  Ordering – Which query results to prefetch first?
3.  Scheduling – When to prefetch?
ü  Two selection and ordering strategies:
–  Offline – queries issued 24 hours before.
–  Online – machine learned prediction of the next arrival.
SEARCH
CLUSTER
BACKEND
SEARCH ENGINE
FRONTEND
SEARCH CLUSTER
FRONTEND
RESULT CACHE
User
QUERY PREFETCHING MODULE
Part III: Prefetching query results and its impact on search engines.
Evaluation and main results
ü  Evaluated with an event-driven search engine simulator and
Web queries sampled Yahoo! Logs.
Ø  Millions of queries, billions of documents, thousands of nodes.
Ø  Methods: no-ttl, no-prefetch, age-dg, online, offline.
ü  The results show that prefetching can improve the hit rate,
query response time, result freshness and query degradation
rate relative to the state-of-the-art baseline.
ü  The key aspect is the prediction methodology.
0 2 4 6 8 10 12 14 16 18 20 22
Hour of the day
80
100
120
140
160
180
200
220
Averagequeryresponsetime(ms)
no-prefetch
no-ttl
on-agedg (R=T/4)
oracle-agedg (R=T/4)
Average query response time (1020 nodes).
0 2 4 6 8 10 12 14 16 18 20 22
Hour of the day
0
40
80
120
160
200
240
280
320
360
400
Averagedegradationrate(x10^-4)
no-prefetch
no-ttl
on-agedg (R=T/4)
oracle-agedg (R=T/4)
Average degradation (850 nodes).
Part III: Prefetching query results and its impact on search engines.
Further work
•  Improved prediction model.
•  Speculative prefetching.
•  Economic evaluation.
Outline
•  Introduction
•  Part I – Partitioned query processing
•  Part II – Skipping and pruning
•  Part III – Caching
•  Conclusions
Conclusions
•  We have contributed to resolving some of the most
important challenges in distributed query processing.
•  Our results indicate significant improvements over
the state-of-the-art methods.
•  Practical applicability and implications, however, need a
further evaluation in real-life settings.
•  There are several opportunities for future work.
Key references
•  Alici at al.: “Adaptive Time-to-Live Strategies for Query Result Caching in Web Search Engines”. In
Proc. ECIR 2012, pp.401-412.
•  Baeza-Yates et al.: “The impact of caching on search engines”. In Proc. SIGIR 2007, pp.183-190.
•  Büttcher, Clarke and Cormack: “Information Retrieval – Implementing and Evaluating Search Engines”,
2010.
•  Cambazoglu et al.: “A refreshing perspective of search engine caching”. In Proc. WWW 2010, pp.
181-190.
•  Gan and Suel: “Improved techniques for result caching in web search engines”. In Proc. WWW 2009,
pp. 431-440.
•  Lester et al.: “Space-Limited Ranked Query Evaluation Using Adaptive Pruning”. In Proc. WISE 2005,
pp. 470-477.
•  Moffat et al.: “A pipelined architecture for distributed text query evaluation”. Inf. Retr. 10(3) 2007, pp.
205-231.
•  Strohman, Turtle and Croft: “Optimization strategies for complex queries”. In Proc. SIGIR 2005, pp.
175-182.
•  Turtle and Flood: “Query Evaluation: Strategies and Optimizations”. Inf. Process. Manage. 31(6) 1995,
pp. 838-849.
•  Yan, Ding and Suel: “Inverted index compression and query processing with optimized document
ordering”. In Proc. WWW 2009, pp. 401-410.
Publications
1.  Baeza-Yates and Jonassen: “Modeling Static Caching in Web Search Engines”. In Proc. ECIR 2012,
pp. 436-446.
2.  Jonassen and Bratsberg: “Improving the Performance of Pipelined Query Processing with Skipping”. In
Proc. WISE 2012, pp. 1-15.
3.  Jonassen and Bratsberg: “Intra-Query Concurrent Pipelined Processing for Distributed Full-Text
Retrieval”. In Proc. ECIR 2012, pp. 413-425.
4.  Jonassen and Bratsberg: “Efficient Compressed Inverted Index Skipping for Disjunctive Text-Queries”.
In Proc. ECIR 2011, pp. 530-542.
5.  Jonassen and Bratsberg: “A Combined Semi-Pipelined Query Processing Architecture for Distributed
Full-Text Retrieval”. In Proc. WISE 2010, pp. 587-601.
6.  Jonassen and Cambazoglu: “Improving Dynamic Index Pruning via Linear Programming”. In progress.
7.  Jonassen, Cambazoglu and Silvestri: “Prefetching Query Results and its Impact on Search Engines”.
In Proc. SIGIR 2012, pp. 631-640.
Secondary (Not included):
1.  Cambazoglu et al.: “A Term-Based Inverted Index Partitioning Model for Efficient Query Processing”.
Under submission.
2.  Rocha-Junior et al.: “Efficient Processing of Top-k Spatial Keyword Queries”. In Proc. SSTD 2011, pp.
205-222.
3.  Jonassen: “Scalable Search Platform: Improving Pipelined Query Processing for Distributed Full-Text
Retrieval”. In Proc. WWW (Comp. Vol.) 2012, pp. 145-150.
4.  Jonassen and Bratsberg: “Impact of the Query Model and System Settings on Performance of
Distributed Inverted Indexes”. In Proc. NIK 2009, pp. 143-151.
Thanks!!!

More Related Content

What's hot

Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
robertlz
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Spark Deep Dive
Spark Deep DiveSpark Deep Dive
Spark Deep Dive
Corey Nolet
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
Carl Lu
 
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
Stefan Adam
 
RDD
RDDRDD
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
Maria Stylianou
 
Pg 95 new capabilities
Pg 95 new capabilitiesPg 95 new capabilities
Pg 95 new capabilities
Jamey Hanson
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Datio Big Data
 
IR
IRIR
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Spark Summit
 
Dremel Paper Review
Dremel Paper ReviewDremel Paper Review
Dremel Paper Review
Arinto Murdopo
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
DataWorks Summit
 
Rethinking Online SPARQL Querying to Support Incremental Result Visualization
Rethinking Online SPARQL Querying to Support Incremental Result VisualizationRethinking Online SPARQL Querying to Support Incremental Result Visualization
Rethinking Online SPARQL Querying to Support Incremental Result Visualization
Olaf Hartig
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Sujit Pal
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Databricks
 
MPTStore: A Fast, Scalable, and Stable Resource Index
MPTStore: A Fast, Scalable, and Stable Resource IndexMPTStore: A Fast, Scalable, and Stable Resource Index
MPTStore: A Fast, Scalable, and Stable Resource Index
Chris Wilper
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theorem
Grisha Weintraub
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 

What's hot (20)

Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Spark Deep Dive
Spark Deep DiveSpark Deep Dive
Spark Deep Dive
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
 
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
 
RDD
RDDRDD
RDD
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
Pg 95 new capabilities
Pg 95 new capabilitiesPg 95 new capabilities
Pg 95 new capabilities
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
IR
IRIR
IR
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
 
Dremel Paper Review
Dremel Paper ReviewDremel Paper Review
Dremel Paper Review
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
Rethinking Online SPARQL Querying to Support Incremental Result Visualization
Rethinking Online SPARQL Querying to Support Incremental Result VisualizationRethinking Online SPARQL Querying to Support Incremental Result Visualization
Rethinking Online SPARQL Querying to Support Incremental Result Visualization
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 
MPTStore: A Fast, Scalable, and Stable Resource Index
MPTStore: A Fast, Scalable, and Stable Resource IndexMPTStore: A Fast, Scalable, and Stable Resource Index
MPTStore: A Fast, Scalable, and Stable Resource Index
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theorem
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 

Viewers also liked

Efficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesEfficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search Engines
Yen-Yu Chen
 
p723-zukowski
p723-zukowskip723-zukowski
p723-zukowski
Hiroshi Ono
 
Modern Workplace: Office 2016
 Modern Workplace: Office 2016 Modern Workplace: Office 2016
Modern Workplace: Office 2016
DWP Information Architects Inc.
 
Modern Workplace 2016 - Susanna Eerola, Microsoft
Modern Workplace 2016 - Susanna Eerola, MicrosoftModern Workplace 2016 - Susanna Eerola, Microsoft
Modern Workplace 2016 - Susanna Eerola, Microsoft
Knowit Oy
 
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Carlos Castillo (ChaTo)
 
Office 365 - Your Modern Workplace
Office 365 - Your Modern WorkplaceOffice 365 - Your Modern Workplace
Office 365 - Your Modern Workplace
Tarek El Jammal
 

Viewers also liked (6)

Efficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesEfficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search Engines
 
p723-zukowski
p723-zukowskip723-zukowski
p723-zukowski
 
Modern Workplace: Office 2016
 Modern Workplace: Office 2016 Modern Workplace: Office 2016
Modern Workplace: Office 2016
 
Modern Workplace 2016 - Susanna Eerola, Microsoft
Modern Workplace 2016 - Susanna Eerola, MicrosoftModern Workplace 2016 - Susanna Eerola, Microsoft
Modern Workplace 2016 - Susanna Eerola, Microsoft
 
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
 
Office 365 - Your Modern Workplace
Office 365 - Your Modern WorkplaceOffice 365 - Your Modern Workplace
Office 365 - Your Modern Workplace
 

Similar to Efficient Query Processing in Distributed Search Engines

Deep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchDeep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorch
Subhashis Hazarika
 
try
trytry
Project Presentation Final
Project Presentation FinalProject Presentation Final
Project Presentation Final
Dhritiman Halder
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee
 
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
confluent
 
Efficient Query Processing Infrastructures
Efficient Query Processing InfrastructuresEfficient Query Processing Infrastructures
Efficient Query Processing Infrastructures
Crai Macdonald
 
Training - What is Performance ?
Training  - What is Performance ?Training  - What is Performance ?
Training - What is Performance ?
Betclic Everest Group Tech Team
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
Mohit Garg
 
Pankaj YADAV
Pankaj YADAVPankaj YADAV
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
Dr Sandeep Kumar Poonia
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
BaliThorat1
 
OOW-IMC-final
OOW-IMC-finalOOW-IMC-final
OOW-IMC-final
Manuel Martin Marquez
 
Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning
MongoDB
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Spark Summit
 
SC17 Panel: Energy Efficiency Gains From HPC Software
SC17 Panel: Energy Efficiency Gains From HPC SoftwareSC17 Panel: Energy Efficiency Gains From HPC Software
SC17 Panel: Energy Efficiency Gains From HPC Software
inside-BigData.com
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
inside-BigData.com
 
COA Unit-5.pptx
COA Unit-5.pptxCOA Unit-5.pptx
COA Unit-5.pptx
Bharti189559
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
Simon Lia-Jonassen
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
Parallel analytics as a service
Parallel analytics as a serviceParallel analytics as a service
Parallel analytics as a service
Petrie Wong
 

Similar to Efficient Query Processing in Distributed Search Engines (20)

Deep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchDeep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorch
 
try
trytry
try
 
Project Presentation Final
Project Presentation FinalProject Presentation Final
Project Presentation Final
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
 
Efficient Query Processing Infrastructures
Efficient Query Processing InfrastructuresEfficient Query Processing Infrastructures
Efficient Query Processing Infrastructures
 
Training - What is Performance ?
Training  - What is Performance ?Training  - What is Performance ?
Training - What is Performance ?
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
Pankaj YADAV
Pankaj YADAVPankaj YADAV
Pankaj YADAV
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
OOW-IMC-final
OOW-IMC-finalOOW-IMC-final
OOW-IMC-final
 
Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
 
SC17 Panel: Energy Efficiency Gains From HPC Software
SC17 Panel: Energy Efficiency Gains From HPC SoftwareSC17 Panel: Energy Efficiency Gains From HPC Software
SC17 Panel: Energy Efficiency Gains From HPC Software
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
 
COA Unit-5.pptx
COA Unit-5.pptxCOA Unit-5.pptx
COA Unit-5.pptx
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
Parallel analytics as a service
Parallel analytics as a serviceParallel analytics as a service
Parallel analytics as a service
 

More from Simon Lia-Jonassen

HyperLogLog and friends
HyperLogLog and friendsHyperLogLog and friends
HyperLogLog and friends
Simon Lia-Jonassen
 
No more bad news!
No more bad news!No more bad news!
No more bad news!
Simon Lia-Jonassen
 
Xgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - ExplainedXgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - Explained
Simon Lia-Jonassen
 
Chatbots are coming!
Chatbots are coming!Chatbots are coming!
Chatbots are coming!
Simon Lia-Jonassen
 
Large-Scale Real-Time Data Management for Engagement and Monetization
Large-Scale Real-Time Data Management for Engagement and MonetizationLarge-Scale Real-Time Data Management for Engagement and Monetization
Large-Scale Real-Time Data Management for Engagement and Monetization
Simon Lia-Jonassen
 
Leveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at CxenseLeveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at Cxense
Simon Lia-Jonassen
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
Simon Lia-Jonassen
 

More from Simon Lia-Jonassen (7)

HyperLogLog and friends
HyperLogLog and friendsHyperLogLog and friends
HyperLogLog and friends
 
No more bad news!
No more bad news!No more bad news!
No more bad news!
 
Xgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - ExplainedXgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - Explained
 
Chatbots are coming!
Chatbots are coming!Chatbots are coming!
Chatbots are coming!
 
Large-Scale Real-Time Data Management for Engagement and Monetization
Large-Scale Real-Time Data Management for Engagement and MonetizationLarge-Scale Real-Time Data Management for Engagement and Monetization
Large-Scale Real-Time Data Management for Engagement and Monetization
 
Leveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at CxenseLeveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at Cxense
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
 

Recently uploaded

Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 

Recently uploaded (20)

Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 

Efficient Query Processing in Distributed Search Engines

  • 1. Efficient Query Processing in Distributed Search Engines Simon Jonassen Department of Computer and Information Science Norwegian University of Science and Technology
  • 2. Outline •  Introduction –  Motivation –  Research questions –  Research process •  Part I – Partitioned query processing •  Part II – Skipping and pruning •  Part III – Caching •  Conclusions
  • 4. Motivation •  Web search: rapid information growth and tight performance constraints. •  Search is used in most of Web applications.
  • 5. Example: Google •  Estimated Web index size in 2012: 35-50 billion pages. •  More than a billion visitors in 2012. •  36 data centers in 2008. •  Infrastructure cost in 4Q 2011: ~ 951 million USD. techcrunch.com
  • 6. Research questions v What are the main challenges of existing methods for efficient query processing in distributed search engines and how can these be resolved? ² How to improve the efficiency of Ø  partitioned query processing Ø  query processing and pruning Ø  caching
  • 7. Research process •  Iterative, exploratory research: Study/ observation Problem/ hypothesis Solution proposal Quantitative Evaluation Publication of results
  • 8. Outline •  Introduction •  Part I – Partitioned query processing –  Introduction –  Motivation –  Combined semi-pipelined query processing –  Efficient compressed skipping for disjunctive text-queries –  Pipelined query processing with skipping –  Intra-query concurrent pipelined processing –  Further work •  Part II – Skipping and pruning •  Part III – Caching •  Conclusions
  • 9. Part I: Introduction. Inverted index approach to IR ©apple.com ? ? ? ? ?
  • 10. Part I: Introduction. Document-wise partitioning •  Each query is processed by all of the nodes in parallel. •  One of the nodes merges the results. •  Advantages: –  Simple, fast and scalable. –  Intra-query concurrency. •  Challenges: –  Each query is processed by all of the nodes. –  Number of disk seeks.
  • 11. Part I: Introduction. Term-wise partitioning •  Each node fetches posting lists and sends them to one of the nodes, which receives and processes all of them. •  Advantages: –  A query involves only a few nodes. –  Smaller number of disk seeks. –  Inter-query concurrency. •  Challenges: –  A single node does all processing. –  Other nodes act just like advanced network disks. –  High network load. –  Load balancing.
  • 12. Part I: Introduction. Pipelined query processing [Moffat et al. 2007] •  A query bundle is routed from one node to next. •  Each node contributes with its own postings. The last node extracts the top results. •  The maximum accumulator set size is constrained. [Lester et al. 2005] •  Advantages: –  The work is divided between the nodes. –  Reduced overhead on the last node. –  Reduced network load. •  Challenges: –  Long query latency. –  Load balancing.
  • 13. Part I: Motivation. Pipelined query processing (TP/PL) •  Several publications find TP/PL being advantageous under certain conditions. –  Expensive disk access, fast network, efficient pruning, high multiprogramming level (query load), small main memory, etc. •  TP/PL has several advantages. –  Number of involved nodes, inter-query concurrency, etc. •  TP/PL may outperform DP in peak throughput. –  Requires high multiprogramming level and results in long query latency. •  Practical results are more pessimistic. Ø How can we improve the method? •  The main challenges of TP/PL [Büttcher et al. 2010]: –  Scalability, load balancing, term/node-at-a-time processing and lack of intra-query concurrency.
  • 14. Part I: Combined semi-pipelined query processing. Motivation •  Pipelined – higher throughput, but longer latency. •  Non-pipelined – shorter latency, but lower throughput. •  We wanted to combine the advantage of both methods: short latency AND high throughput.
  • 15. Part I: Combined semi-pipelined query processing. Contributions ü  Semi-pipelined query processing: 1 2 1 2
  • 16. Part I: Combined semi-pipelined query processing. Contributions and evaluation ü  Decision heuristic and alternative routing strategy. ü  Evaluated using a modified version of Terrier v2.2 (http://terrier.org): Ø  426GB TREC GOV2 Corpus – 25 million documents. Ø  20 000 queries from the TREC Terabyte Track Efficiency Topics 2005 (first 10 000 queries were used as a warm-up). Ø  8 nodes with 2x2.0GHz Intel Quad-Core, 8GB RAM and a SATA hard-disk on each node. Gigabit network.
  • 17. Part I: Combined semi-pipelined query processing. Main results 20 40 60 80 100 120 140 200 300 400 500 600 Throughput(qps) Latency (ms) non-pl plcomp altroute+combα=0.2
  • 18. Part I: Combined semi-pipelined query processing. Post-note and motivation for other papers •  “Inefficient” compression methods: –  Gamma for doc ID gaps and Unary for frequencies. •  Posting lists have to be fetched and decompressed completely and remain in main memory until the sub- query has been fully processed. •  Can we do this better? –  Modern “super-fast” index compression. •  NewPFoR [Yan, Ding and Suel 2009]. –  Skipping. –  Partial evaluation (pruning).
  • 19. Part I: Efficient compressed skipping for disjunctive text-queries. Contributions ü  A self-skipping inverted index designed specifically for NewPFoR compression.
  • 20. Part I: Efficient compressed skipping for disjunctive text-queries. Contributions ü  Query processing methods for disjunctive (OR) queries: Ø A skipping version of the space-limited pruning method by Lester et al. 2007. Ø A description of Max-Score that combines the advantage of the descriptions by Turtle and Flood 1995 and Strohman, Turtle and Croft 2005. Ø We also explained how to apply skipping and presented updated performance results. AND OR latency
  • 21. Part I: Efficient compressed skipping for disjunctive text-queries. Evaluation ü  Evaluated with a real implementation: Ø  TREC GOV2. •  15.4 million terms, 25.2 million docs and 4.7 billion postings. •  Index w/o skipping is 5.977GB. Skipping adds 87.1MB. Ø  TREC Terabyte Track Efficiency Topics 2005. •  10 000 queries with 2+ query terms. Ø  Workstation: 2.66GHz Intel Quad-Core 2, 8GB RAM, 1TB 7200RPM SATA2. 16KB blocks by default.
  • 22. Part I: Efficient compressed skipping for disjunctive text-queries. Main results 348 260 93.6 51.8 44.4 283 90.7 120 47.8 42.2
  • 23. Part I: Pipelined query processing with skipping. Contributions ü  Pipelined query processing with skipping. ü  Pipelined Max-Score with DAAT sub-query processing. ü  Posting list assignment by maximum scores.
  • 24. Part I: Pipelined query processing with skipping. Evaluation •  Evaluated with a real implementation: Ø  Same index design as in the last paper, but now extended to a distributed system. Ø  TREC GOV2 and Terabyte Track Efficiency Topics 2005/2006: •  5 000 queries – warm-up, 15 000 – evaluation. Ø  9 node cluster – same as in the first paper, plus a broker.
  • 25. Part I: Pipelined query processing with skipping. Main results
  • 26. Part I: Intra-query concurrent pipelined processing. Motivation •  At low query loads there is available CPU/network capacity. •  In this case, the performance can be improved by intra-query concurrency. •  How can we provide intra-query concurrency with TP/PL?
  • 27. Part I: Intra-query concurrent pipelined processing. Contributions ü  Parallel sub-query execution on different nodes. Ø  Fragment-based processing: Ø Each fragment covers a document ID range. Ø The number of fragments in each query depends on the estimated processing cost.
  • 28. Part I: Intra-query concurrent pipelined processing. Contributions ü  Concurrent sub-query execution on the same node. Ø When the query load is low, different fragments of the same query can be processed concurrently. Ø Concurrency depends on the number of fragments in the query and the load on the node.
  • 29. Part I: Intra-query concurrent pipelined processing. Evaluation and main results ü  Evaluated with TREC GOV2, TREC Terabyte Track Efficiency Topics 2005 (2 000 warm-up and 8 000 test queries).
  • 30. Part I. Further work •  Evaluation of scalability: –  So far: 8 nodes and 25 mil documents. •  Further extension of the methods: –  Dynamic load balancing. –  Caching. •  Open questions: –  Term- vs document-wise partitioning. –  Document- vs impact-ordered indexes.
  • 31. Outline •  Introduction •  Part I – Partitioned query processing •  Part II – Skipping and pruning –  Contributions presented within Part I –  Improving pruning efficiency via linear programming •  Part III – Caching •  Conclusions
  • 32. Part II. Contributions presented within Part I ü  Efficient processing of disjunctive (OR) queries (monolithic index). ü  Pipelined query processing with skipping (term-wise partitioned index).
  • 33. Part II: Improving dynamic index pruning via linear programming. Motivation and contributions •  Pruning methods s.a. Max-Score and WAND rely on tight score upper-bound estimates. •  The upper-bound for a combination of several terms is smaller than the sum of the individual upper-bounds. •  Terms and their combinations repeat. ü  LP approach to improving the performance of Max-Score/WAND. Ø  Relies on LP estimation of score bounds for query subsets. Ø  In particular, uses upper bounds of single terms and the most frequent term-pairs. anna banana nana 0.5 0.6 0.7 0.6 1.0 1.3
  • 34. Part II: Improving dynamic index pruning via linear programming. Evaluation and extended results ü  Evaluated with a real implementation. Ø  20 million Web pages, 100 000 queries from Yahoo, Max-Score. ü  The results indicate significant I/O, decompression and computational savings for queries of 3 to 6 terms. |q|   AND  (ms)   OR  (ms)   3   0.002   0.002   4   0.01   0.016   5   0.082   0.136   6   0.928   1.483   7   14.794   22.676   8   300.728   437.005   |q|   T  (ms)   TI   TI**   AND   3   10.759   1.59%   1.64%   (k=10)   4   10.589   1.30%   1.43%   5   9.085   0.09%   0.97%   6   6.256   -­‐14.86%   -­‐0.95%   OR   3   18.448   3.74%   3.78%   (k=10)   4   24.590   6.85%   6.93%   5   33.148   7.85%   8.24%   6   42.766   6.04%   9.08%   Latency improvements. LP solver latency. |q|   %   1   14.91%   2   36.80%   3   29.75%   4   12.35%   5   4.08%   6   1.30%   7   0.42%   8+   0.39%   Query length distribution. ** 10x faster LP solver.
  • 35. Part II: Improving dynamic index pruning via linear programming. Further work •  Larger document collection. •  Other types of queries. •  Reuse of previously computed scores. •  Application to pipelined query processing.
  • 36. Outline •  Introduction •  Part I – Partitioned query processing •  Part II – Skipping and pruning •  Part III – Caching –  Modeling static caching in Web search engines –  Prefetching query results and its impact on search engines •  Conclusions
  • 37. Result Cache Query Pre-processing Query Post-processing Query Processing (Lookup) (Hit) (Miss) List Cache Index (Miss) (Lookup) •  Result cache: •  Lower hit rate. •  Less space consuming. •  Significant time savings. •  List cache: •  Higher hit rate. •  More space consuming. •  Less significant time savings. •  Finding optimal split in a static two-level cache. [Baeza-Yates et al. 2007] •  Scenarios: local, LAN, WAN. Part III: Modeling static caching in Web search engines. Motivation
  • 38. Part III: Modeling static caching in Web search engines. Contributions, evaluation and further work ü  A modeling approach that allows to find a nearly optimal split in a static two-level cache. Ø  Requires estimation of only six parameters. ü  Evaluated with a cache-simulator: Ø  37.9 million documents from UK domain (2006) Ø  80.7 million queries (April 2006 - April 2007) •  Further work: –  Larger query log and document collection. –  Improved model. –  Other types of cache.
  • 39. Part III: Prefetching query results and its impact on search engines. Background and related work •  Result caching: –  Reuse of previously computed results to answer future queries. •  Our data: About 60% of queries repeat. –  Limited vs infinite-capacity caches. •  Limited: Beaza-Yates et al. 2007, Gan and Suel 2009. •  Infinite: Cambazoglu et al. 2010. –  Cache staleness problem. •  Results become stale after a while. •  Time-to-live (TTL) invalidation. –  Cambazoglu et al. 2010, Alici et al. 2012. •  Result freshness. –  Cambazoglu et al. 2010.
  • 40. Part III: Prefetching query results and its impact on search engines. Motivation 0 2 4 6 8 10 12 14 16 18 20 22 Hour of the day 0 5 10 15 20 25 30 35 40 Backendquerytraffic(query/sec) TTL = 1 hour TTL = 4 hours TTL = 16 hours TTL = infinite Opportunity for prefetching Risk of overflow Potential for load reduction Peak query throughput sustainable by the backend
  • 41. Part III: Prefetching query results and its impact on search engines. Contributions ü  Three-stage approach: 1.  Selection – Which query results to prefetch? 2.  Ordering – Which query results to prefetch first? 3.  Scheduling – When to prefetch? ü  Two selection and ordering strategies: –  Offline – queries issued 24 hours before. –  Online – machine learned prediction of the next arrival. SEARCH CLUSTER BACKEND SEARCH ENGINE FRONTEND SEARCH CLUSTER FRONTEND RESULT CACHE User QUERY PREFETCHING MODULE
  • 42. Part III: Prefetching query results and its impact on search engines. Evaluation and main results ü  Evaluated with an event-driven search engine simulator and Web queries sampled Yahoo! Logs. Ø  Millions of queries, billions of documents, thousands of nodes. Ø  Methods: no-ttl, no-prefetch, age-dg, online, offline. ü  The results show that prefetching can improve the hit rate, query response time, result freshness and query degradation rate relative to the state-of-the-art baseline. ü  The key aspect is the prediction methodology.
  • 43. 0 2 4 6 8 10 12 14 16 18 20 22 Hour of the day 80 100 120 140 160 180 200 220 Averagequeryresponsetime(ms) no-prefetch no-ttl on-agedg (R=T/4) oracle-agedg (R=T/4) Average query response time (1020 nodes).
  • 44. 0 2 4 6 8 10 12 14 16 18 20 22 Hour of the day 0 40 80 120 160 200 240 280 320 360 400 Averagedegradationrate(x10^-4) no-prefetch no-ttl on-agedg (R=T/4) oracle-agedg (R=T/4) Average degradation (850 nodes).
  • 45. Part III: Prefetching query results and its impact on search engines. Further work •  Improved prediction model. •  Speculative prefetching. •  Economic evaluation.
  • 46. Outline •  Introduction •  Part I – Partitioned query processing •  Part II – Skipping and pruning •  Part III – Caching •  Conclusions
  • 47. Conclusions •  We have contributed to resolving some of the most important challenges in distributed query processing. •  Our results indicate significant improvements over the state-of-the-art methods. •  Practical applicability and implications, however, need a further evaluation in real-life settings. •  There are several opportunities for future work.
  • 48. Key references •  Alici at al.: “Adaptive Time-to-Live Strategies for Query Result Caching in Web Search Engines”. In Proc. ECIR 2012, pp.401-412. •  Baeza-Yates et al.: “The impact of caching on search engines”. In Proc. SIGIR 2007, pp.183-190. •  Büttcher, Clarke and Cormack: “Information Retrieval – Implementing and Evaluating Search Engines”, 2010. •  Cambazoglu et al.: “A refreshing perspective of search engine caching”. In Proc. WWW 2010, pp. 181-190. •  Gan and Suel: “Improved techniques for result caching in web search engines”. In Proc. WWW 2009, pp. 431-440. •  Lester et al.: “Space-Limited Ranked Query Evaluation Using Adaptive Pruning”. In Proc. WISE 2005, pp. 470-477. •  Moffat et al.: “A pipelined architecture for distributed text query evaluation”. Inf. Retr. 10(3) 2007, pp. 205-231. •  Strohman, Turtle and Croft: “Optimization strategies for complex queries”. In Proc. SIGIR 2005, pp. 175-182. •  Turtle and Flood: “Query Evaluation: Strategies and Optimizations”. Inf. Process. Manage. 31(6) 1995, pp. 838-849. •  Yan, Ding and Suel: “Inverted index compression and query processing with optimized document ordering”. In Proc. WWW 2009, pp. 401-410.
  • 49. Publications 1.  Baeza-Yates and Jonassen: “Modeling Static Caching in Web Search Engines”. In Proc. ECIR 2012, pp. 436-446. 2.  Jonassen and Bratsberg: “Improving the Performance of Pipelined Query Processing with Skipping”. In Proc. WISE 2012, pp. 1-15. 3.  Jonassen and Bratsberg: “Intra-Query Concurrent Pipelined Processing for Distributed Full-Text Retrieval”. In Proc. ECIR 2012, pp. 413-425. 4.  Jonassen and Bratsberg: “Efficient Compressed Inverted Index Skipping for Disjunctive Text-Queries”. In Proc. ECIR 2011, pp. 530-542. 5.  Jonassen and Bratsberg: “A Combined Semi-Pipelined Query Processing Architecture for Distributed Full-Text Retrieval”. In Proc. WISE 2010, pp. 587-601. 6.  Jonassen and Cambazoglu: “Improving Dynamic Index Pruning via Linear Programming”. In progress. 7.  Jonassen, Cambazoglu and Silvestri: “Prefetching Query Results and its Impact on Search Engines”. In Proc. SIGIR 2012, pp. 631-640. Secondary (Not included): 1.  Cambazoglu et al.: “A Term-Based Inverted Index Partitioning Model for Efficient Query Processing”. Under submission. 2.  Rocha-Junior et al.: “Efficient Processing of Top-k Spatial Keyword Queries”. In Proc. SSTD 2011, pp. 205-222. 3.  Jonassen: “Scalable Search Platform: Improving Pipelined Query Processing for Distributed Full-Text Retrieval”. In Proc. WWW (Comp. Vol.) 2012, pp. 145-150. 4.  Jonassen and Bratsberg: “Impact of the Query Model and System Settings on Performance of Distributed Inverted Indexes”. In Proc. NIK 2009, pp. 143-151.