Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
List intersection for web search: Algorithms, Cost Models, and Optimizations
1. LIST INTERSECTION FOR WEB SEARCH:
ALGORITHMS, COST MODELS, AND
OPTIMIZATIONS
Sunghwan Kim (POSTECH),Taesung Lee (IBM Research AI),
Seung-won Hwang (Yonsei University), Sameh Elnikety (Microsoft Research)
VLDB 2019
2. List Intersection inWeb Search
Doc id Text
105 … research, so to generate the
optimal query plan for the given
scenario, as commonly used in the
database systems…
… …
592 … My research interests are in
database system and data-driven
intelligence, …
… …
ℐ 𝑥 document IDs
database … 105 … 592 842 …
system … 105 … 592 751 …
research … 105 … 592 642 …
data-driven … 321 … 592 632 …
intelligence … 256 … 592 925 …
Multi-word query in web search engine: list intersection of posting lists
Corpus: document word list Posting list: word document list
ℐ 𝑥 = posting list of word x
2
3. Q. “database system research”
List Intersection in Web Search
Multi-word query in web search engine: list intersection of posting lists
3
ℐ 𝑥 = inverted list of word x
4. Q. “database system research”
A. ℐ(“database”) ∩ ℐ(“system”) ∩ ℐ(“research”)
List Intersection in Web Search
Multi-word query in web search engine: list intersection of posting lists
4
ℐ 𝑥 = inverted list of word x
5. List Intersection inWeb Search
Q. “database system research”
A. ℐ(“database”)∩ℐ(“system”)∩ℐ(“research”)
5
Multi-word query in web search engine: list intersection of posting lists
ℐ 𝑥 = inverted list of word x
ℐ 𝑥 document IDs
database … 105 … 592 842 …
system … 105 … 592 751 …
research … 105 … 592 642 …
∩ … 105 … 592 …
22. Challenge for Optimization
Scenario #1. length ratio 1:1
$(Scan-based) < $(Search-based)
Scenario #2. length ratio 1:1000
$(Scan-based) > $(Search-based)
22
No method wins in every scenarios requires query optimization
1MA
B 1M
1KA
B 1M
24. Complexity of Cost Estimation
24
■ Cost of a comparison is not uniform.
■ Modern architecture pipelines the
execution of instructions.
10
10
10
10
10Fetch
Decode
Execute
Memory
Write Back
𝑡1 𝑡3 𝑡5𝑡2 𝑡4
time (cycle)
pipeline stage
25. Complexity of Cost Estimation
25
■ Cost of a comparison is not uniform.
■ Modern architecture pipelines the
execution of instructions.
10
10
10
10
10Fetch
Decode
Execute
Memory
Write Back
𝑡1 𝑡3 𝑡5𝑡2 𝑡4
pipeline stage
9
8
7
6
time (cycle)
26. Complexity of Cost Estimation
26
■ Cost of a comparison is not uniform.
■ Modern architecture pipelines the
execution of instructions.
10
10
10
10
10Fetch
Decode
Execute
Memory
Write Back
𝑡1 𝑡3 𝑡5𝑡2 𝑡4
pipeline stage
9
8
7
6
9 11 12
11
13
14
1211
131211
987
98
time (cycle)
27. Complexity of Cost Estimation
27
■ Cost of a comparison is not uniform.
■ Modern architecture pipelines the
execution of instructions.
■ Branch can block the pipeline.
10
10
10
10
10Fetch
Decode
Execute
Memory
Write Back
𝑡1 𝑡3 𝑡5𝑡2 𝑡4
time (cycle)
pipeline stage
9
8
7
6
9
987
98
10 JEQ 6
? Empty? 6 or 11?
28. Complexity of Cost Estimation
28
■ Cost of a comparison is not uniform.
■ Modern architecture pipelines the
execution of instructions.
■ Branch can block the pipeline.
– CPU predicts the result of branch.
10
10
10
10
10Fetch
Decode
Execute
Memory
Write Back
𝑡1 𝑡3 𝑡5𝑡2 𝑡4
time (cycle)
pipeline stage
9
8
7
6
9
987
98
10 JEQ 6
? Predict 6 or 11!
29. Complexity of Cost Estimation
29
■ Cost of a comparison is not uniform.
■ Modern architecture pipelines the
execution of instructions.
■ Branch can block the pipeline.
– CPU predicts the result of branch.
10
10
10
10
10Fetch
Decode
Execute
Memory
Write Back
𝑡1 𝑡3 𝑡5𝑡2 𝑡4
time (cycle)
pipeline stage
9
8
7
6
9
987
98
10 JEQ 6
6 7
6
8
9
76
876
Predicted!
30. 10
10
10
10
10
Complexity of Cost Estimation
30
■ Cost of a comparison is not uniform.
■ Modern architecture pipelines the
execution of instructions.
■ Branch can block the pipeline.
– CPU predicts the result of branch.
Fetch
Decode
Execute
Memory
Write Back
𝑡1 𝑡3 𝑡5𝑡2 𝑡4
time (cycle)
pipeline stage
9
8
7
6
9
987
98
10 JEQ 6
6 7
6
8
9
76
876
Wrong!
31. 10
10
10
10
10
Complexity of Cost Estimation
31
■ Cost of a comparison is not uniform.
■ Modern architecture pipelines the
execution of instructions.
■ Branch can block the pipeline.
– CPU predicts the result of branch.
– Failure: 10-40 cycles of penalty.
■ Branch Misprediction
Fetch
Decode
Execute
Memory
Write Back
𝑡1 𝑡3 𝑡5𝑡2 𝑡4
time (cycle)
pipeline stage
9
8
7
6
9
987
98
10 JEQ 6
LOST!
Wrong!
32. Complexity of Cost Estimation
Access Results Penalty
L1 hit 4 cycles
L1 miss + L2 hit 12 cycles
L2 miss + L3 hit 42 cycles
L3 miss
42 cycles
+ RAM latency
(200+ cycles)
32
■ Cost of a comparison is not uniform.
■ Modern architecture pipelines the
execution of instructions.
■ Branch can block the pipeline.
– CPU predicts the result of branch.
– Failure: 10-40 cycles of penalty.
■ Branch Misprediction
■ Cache/TLB misses are expensive.
– From 12 to 200+ cycles of latency.
Cache miss penalties in Intel Skylake
33. Motivations
33
There is no single winner in all scenarios.
Optimization based on the number of
comparisons is not always optimal.
Cost of a comparison is not always same
in modern architecture
42. Cost-based Query Optimizer
Cost Optimizer
■ Suggests cost optimal execution plan for
given query.
Challenges
■ The cost of optimization should be
negligible.
■ Thus, requires lightweight cost model
with high accuracy.
42
Cost Optimizer
43. Cost Model
Cost model
■ Estimate cost of algorithms for given input
■ Consider input properties
– Lengths and correlations
43
Correlations
Lengths
8MA
B 16M
4MA ∩ B
44. Cost Model
Cost model
■ Estimate cost of algorithms for given input
■ Consider input properties
– Lengths and correlations
Challenges
■ Cost depends on hardware properties.
– eg. cache efficiency, branch misprediction.
■ Analysis of algorithm is complex.
44
Archite
ctures
Intel
Xeon
Archite
ctures
Algorithms
Cost Model
unit cost vector event counts
46. Cost Model
Procedures
1. Identify expensive events.
– Loops, branches or memory accesses.
2. Parametrize architecture properties.
– unit cost: cost of an event execution.
46
unit cost vector
Archite
ctures
Intel
Xeon
Archi-
tecture
2. Parametrize
1. Identify events
algorithms
47. Cost Model
Procedures
1. Identify expensive events.
– Loops, branches or memory accesses.
2. Parametrize architecture properties.
– unit cost: cost of an event execution.
3. Model function of event count.
– e.g. # iterations, # mispredictions
47
unit cost vector event counts
Archite
ctures
Intel
Xeon
Archi-
tecture
𝑓(𝑞𝑢𝑒𝑟𝑦)
2. Parametrize
3. model
1. Identify events
algorithms
48. Cost Model
Procedures
1. Identify expensive events.
– Loops, branches or memory accesses.
2. Parametrize architecture properties.
– unit cost: cost of an event execution.
3. Model function of event count.
– e.g. # iterations, # mispredictions
– Computes in real-time.
48
unit cost vector event counts
Archite
ctures
Intel
Xeon
Archi-
tecture
𝑓(𝑞𝑢𝑒𝑟𝑦)
2. Parametrize
computes
in real-time
3. model
1. Identify events
algorithms
49. Cost Model
Procedures
1. Identify expensive events.
– Loops, branches or memory accesses.
2. Parametrize architecture properties.
– unit cost: cost of an event execution.
3. Model function of event count.
– e.g. # iterations, # mispredictions
– Computes in real-time.
49
Cost Model
unit cost vector event counts
𝑓(𝑞𝑢𝑒𝑟𝑦)
50. Step 1. Event Identification
Event identification
■ Total cost = sum of event cost.
50
cost of an event
51. Step 1. Event Identification
Event identification
■ Total cost = sum of event cost.
■ Cost classification (2 levels)
– 1-level: base latency, branch misprediction, memory overhead.
51
cost of an event
52. Step 1. Event Identification
Event identification
■ Total cost = sum of event cost.
■ Cost classification (2 levels)
– 1-level: base latency, branch misprediction, memory overhead.
– 2-level: identified events expected to affect the entire cost.
52
cost of an event
53. Step 2. Parametrize Unit Cost
Parametrize unit cost
■ Learn in the machine by using the synthetic test set.
■ Use gradient descent solver.
53
unit cost vector
Archite
ctures
Intel
Xeon
Archi-
tecture
Learning from the test set
54. 25
… 13 14 19 20 21 23 26 28 …
25
… 13 14 19 20 21 23 26 28 …
pivot
A search of 2-Gallop algorithm
Step 3. Event Count Estimation
■ Understanding displacement (𝜹) distribution is key to estimate the event counts.
– displacement: distance of cursor moved.
54
25
… 13 14 19 20 21 23 26 28 …
25
… 13 14 19 20 21 23 26 28 …
pivot
A search of 2-Merge algorithm
55. Step 3. Event Count Estimation
■ Understanding displacement (𝜹) distribution is key to estimate the event counts.
– displacement: distance of cursor moved.
55
A search of 2-Gallop algorithm
25
… 13 14 19 20 21 23 26 28 …
25
… 13 14 19 20 21 23 26 28 …
pivot
displacement = 6
# comparison = 5
# reference = 6
# cache miss = 0
# MISP = 2
56. Step 3. Event Count Estimation
■ Understanding displacement (𝜹) distribution is key to estimate the event counts.
– displacement: distance of cursor moved.
56
displacement = 6
# comparison = 5
# reference = 6
# cache miss = 0
# MISP = 2
A search of 2-Gallop algorithm
25
… 13 14 19 20 21 23 26 28 …
25
… 13 14 19 20 21 23 26 28 …
pivot
57. Step 3. Event Count Estimation
■ Understanding displacement (𝜹) distribution is key to estimate the event counts.
– eg. memory reference count of 2-Gallop: 2 𝐿1 0
|𝐿2|
𝑷 𝑫 = 𝜹 ∙ 1 + log2 𝜹
57
displacement
distribution
A search of 2-Gallop algorithm
25
… 13 14 19 20 21 23 26 28 …
25
… 13 14 19 20 21 23 26 28 …
pivot
displacement = 6
# comparison = 5
# reference = 6
# cache miss = 0
# MISP = 2
58. Step 3. Event Count Estimation
■ Understanding displacement (𝜹) distribution is key to estimate the event counts.
– eg. memory reference count of 2-Gallop: 2 𝐿1 0
|𝐿2|
𝑷 𝑫 = 𝜹 ∙ 1 + log2 𝜹
■ Modeling the displacement distribution by using
– Negative hypergeometric (NHG) distribution and Markov Chain
58
Displacement distribution
(2-way, 𝐴 : |𝐵| = 1:1)
59. Experimental Settings
Machine
■ SandyBridge i7-3820 3.60GHz
■ IvyBridge i7-3770k 2.93Ghz
Synthetic Dataset
■ Number of lists: 2 to 4
■ Length ratio (min:max) : 1:1 to 1:1024
■ Correlation: 0 to 1
Set of Algorithm in Optimizer
■ 2-way
– 2-Merge, 2-Gallop, 2-SIMD, STL
– SIMDV1, SIMDV3, SIMD Gallop [1]
– SIMD Inoue [2]
■ k-way: k-Merge, k-Gallop
■ Build cost model of each algorithm.
[1] D. Lemire, et. al., Simd compression and the intersection of sorted integers.
Software: Practice and Experience, 2016.
[2] H. Inoue, et. al., Faster set intersection with simd instructions by reducing
branch mispredictions. VLDB 2014
59
60. Accuracy of the Cost Model
- 2-way and k-way
60
Accuracy of 2-way algorithm cost model Accuracy of k-way algorithm cost model
61. Accuracy of the Cost Model
- Adaptive to machine
61
in SandyBridge in IvyBridge
Accuracy of 2-way algorithm cost model
62. Optimizer Efficiency
62
Comparison of representative algorithms in four list synthetic dataset.
(Bold: Best of all, Italic: Best among single algorithm)
63. Conclusion
■ Cost analysis and estimation should take into account the properties of architecture.
■ Analyzing the probability of the algorithm’s operation provides the result of a more
accurate cost estimation.
■ Based on two observations, we propose cost-based optimizer that is equipped with
lightweight and accurate cost model.
63
66. Top-down Analysis
■ Hardware profiling method
■ Finding bottleneck/hotspots
■ Cost decomposition
4 major cost factors
■ Based on the part utilization at each cycle.
66
Frontend bound Backend bound
No overhead: Retiring
Bad speculation
Frontend→
↓Backend
Stall Fully Utilized
Stall Bad speculation Backend bound
Fully Utilized Frontend bound Retiring
67. Step 1. Event Identification
■ Cycle accounting (eg. top-down analysis)
– hard to identify causes of cost hard to use for cost estimation.
■ We solve this mismatch by using cause-based pivoting.
67
Our cause-based pivoting
69. Identify expensive events
- Example: 2-Merge
69
2-Merge
■ Base latency (α)
■ Branch misprediction (β)
– Significant at low length ratio
■ Memory stalls (γ)
– Marginal (due to sequential scan)
𝛼
𝛽: >, <, =
70. Probabilistic Analysis of Algorithms
- 2-way displacement analysis
𝛿 can be modeled by negative hypergeometric distribution (NHG)
NHG(k; N, K, r) = k-th “success” shown up after r “failures”
N population: K successes, N-K failures
ex. NHG(4; 8, 4, 2) = LA dodgers wins Milwaukee Brewers by 4-2 in NLCS
(assumption: winning distribution is uniform)
x
A
(failure)
B – A
(success)
z
𝛿 of search y 𝛿 of search z
y
70
71. Probabilistic Analysis of Algorithms
- k-way displacement analysis
displacement 𝛿𝑖: affected by k each search.
𝛿𝑖 =
𝑗=1
𝑘
𝛿𝑖𝑗
Decompose a displacement 𝛿𝑖
into several subdisplacements 𝛿𝑖𝑗.
Model each 𝛿𝑖𝑗 through markov chain.
If search successes, 𝑣𝑗−1 = 𝑣𝑗, 𝛿𝑖𝑗 = 0
Otherwise, 𝑣𝑗−1 ≠ 𝑣𝑗, 𝛿𝑖𝑗 follows NHG
(similar with 2-way)
71
Editor's Notes
Hi, I’m SungHwan from POSTECH Korea, and it is a great honor to present my work at VLDB.
Today, I’ll talk about cost-based optimization for list intersection in modern architecture.
Web search engine accesses the web corpus using the posting lists and uses intersection algorithms to process multi-keyword queries.
Each posting list covers document IDs that contains assigned query keyword to the posting list.
For example, the posting list for keyword ‘database’ contains document IDs that each document contains the word ‘database’ in its text.
So when a multi-keyword query comes in,
the search engine computes the query by
intersecting posting lists corresponding to the given query keywords.
The list intersection can be computed by sequence of searches like searching all elements in A to B.
For example, we search the first value of A 25 from the other list, and If the search is failed, then
we discard every of preceding elements then
and we move on to the next element, 37,
If the value is found in other list, then add the value to the result.
and we repeat and repeat again until any of the cursor reach to the end of list.
So the main algorithm design issue is that how we search an element from an array list.
There are two main categories of list intersection algorithms;
(clk)
First one is scan-based approaches that scan elements through one by one,
There are two main categories of list intersection algorithms;
(clk)
First one is scan-based approaches that scan elements through one by one,
There are two main categories of list intersection algorithms;
(clk)
First one is scan-based approaches that scan elements through one by one,
and the other is search-based approaches that try to jump over a few elements to reduce the number of comparisons.
and the other is search-based approaches that try to jump over a few elements to reduce the number of comparisons.
and the other is search-based approaches that try to jump over a few elements to reduce the number of comparisons.
and the other is search-based approaches that try to jump over a few elements to reduce the number of comparisons.
Regardless of which algorithm we use, the distance of cursor moveㄴ will not be changed for a specific search.
In this case, the distances are six in both search.
We call this distance as the displacement, that is the part that we want to analyze, which will be covered later.
It might sound like a search-based algorithm skipping some elements should be clearly better than a scan-based algorithm going through all items, but the problem is not that simple.
It means that there’s no single winner among intersection algorithms in all scenarios.
For example, in scenario #1 with one million items in the two lists, the scan-based intersection approach has a lower cost than a search-based approach in general.
On the other hand, in scenario #2 with only one thousand items in one list, and one million items in the other, a search-based approach is significantly faster than a scan-based algorithm.
So to propose optimal plan for computing list intersection, we should understand the cost of each algorithm for the given input.
However, the cost estimation is also challenging, because the cost of a comparison is not uniform so thus we cannot estimate the cost of list intersection by estimating the count of comparison.
That's why the modern architectures pipeline the execution of instructions.
For example, in 5-stage pipelined architecture, the execution of an instruction is executed across in 5 cycles with five steps.
and for each cycle, modern architecture try to fully utilize its modules by parallelizing instruction execution.
So at the time 1, not only to fetch number 10, prior instructions such as 6 to 9 are already in other pipeline stage.
and similarly the architecture try to fill the pipelines in every cycle.
It is working ideally if all the pipeline stages are completed in a cycle, and we have fixed sequence of instruction to execute,
However if there is some conditional branch, then the problem is much complicated.
For example, if the instruction 10 is a branch operation, then what is the next instruction after 10?
It is unclear before the branch is identified, however as an evil CEO, the modern architecture does not allow their employee free,
which means that they predict the result of branch and apply the prediction result to the pipeline.
So the pipeline is filled based on predicted results.
However, the problem is arising if the prediction is verified as wrong,
In this case, we lost all of the pipeline result and we will spend penalties around to 10 to 40 cycles.
We call this situation as the branch misprediction and this penalty as the branch misprediction penalty.
The other overhead that breaks the scalability of the architecture is memory overhead mainly caused by cache or TLB misses.
For example, in the Skylake architecture environment, we can access to the memory in only 4 cycles in best scenario, however in worst case, we spend several hundreds of cycles.
So we identify two challenges as follows: the first is that no algorithm wins in all scenarios.
The second is that the cost of a comparison is not always same.
So, the computation plan should be carefully selected by considering the properties of the given query and the server machine.
The role of query optimizer is to suggest plan of query computation for a given query.
For example, for the query “database systems”, the optimizer suggests plan to compute the query with
fastest algorithm for the given input, like 2-Merge, and
and in some other scenario it can suggest to use other algorithm.
Not only to work in a specific architecture with a fixed set of algorithms, we have two main objectives,
the first is to propose optimization method which can be adapt to the different architectures like state-of-the-art of AMD or Intel.
the second goal is to propose general approach of algorithm analysis for the cost estimation that can be applicable for the future algorithms.
So the objectives of cost optimizer can be formally defined as follows that is providing cost optimal execution plan for the given query, and
The main challenge is that the run-time cost of the optimizer should be negligible compared to the benefit from the optimization, so it requires high accuracy cost model with a few computational overhead.
We develop a lightweight cost model that takes lists as input and returns the expected cost by considering two properties: one is the lengths of lists and the other is the correlations between lists.
To build high accuracy cost model,
we need to model both hardware and algorithm properties.
unit cost 는 어떤 vector 일 것같고 등등의 이해를 돕도록
So we will introduce the procedure of generating cost model.
The first one is to identify expensive events from the algorithm implementation,
such as loops, branches or memory references.
And then, parametrize architecture properties into the unit cost which represents the cost of an event execution.
Then model the function of event count that accepts query, then returns the vector of event counts such as the number of iterations of loop or the number of branch misprediction for a specific branch.
As the event count rely on the input properties, the event count vector is computed at query-time.
and we call this framework as the cost model.
So we formulate the model by decomposing the total cost into the sum of event costs.
In detail, we first divide cost of an application into three important factors: base latency, and two overheads of misprediction and memory access.
// branch misprediction이랑 memory overhead에 대해서 앞에 다루어야 함.
Then, we further classify the cost into events according to which is expected to affect the entire cost significantly.
and we parametrize unit cost of each event in the given machine by learning from the synthetic test set.
To reduce error, we adopt gradient descent solver in learning stage.
The next step is to compute count of identified events, by understanding the behavior of algorithms.
As mentioned earlier, there are several algorithms, but even if the algorithms differ, the displacement are the same, which is the distance of cursor move.
We nominate this displacement as the important feature of event count analysis.
For example, for given displacement for a single search, we can calculate event counts of the search based on the displacement,
we further can formulate function of each count corresponding to the displacement.
Then, we can formulate the estimated event count by understanding the displacement distribution.
For example, we can compute total memory reference count by using the distribution and function of memory references corresponding to displacement.
And we formulate the distribution by using negative hypergeometric distribution and markov chain.
This graph shows an example of the displacement distribution for two equal length lists.
So the average of displacement is 1 in this case, but at 50% search, the cursor does not move forward to the next value.
See our paper to check further details about the modeling
왼쪽 오른쪽 그래프의 차이가 뭐? 설명 안할 것이면 아예 안보여주는게 좋지 않을까
오른쪽이 continuous 하다는 오해를 주면 안됨.
보여주려면 보여주려는 것만 차이를 부각하여 설명. 설명하려는 내용만 차이점을 부각.
경재: 그래프를 빨간색, 파란색 으로 구분? 두 그래프의 모양이 동일한데 scale이 다른건데.. ??
Next, we demonstrate our work.
As our goal is to show efficiency, effectiveness and adaptiveness of our method, we test our method in wide scenario with various state-of-the-art algorithms in two machines.
We first introduce accuracy of the cost model.
As we can see, the estimated cost well follows the actual cost in wide range of length ratio.
This represents that we successfully modeled most of hardware related features.
Also, our approach can adapt to the different machine very well.
Left is the result in SandyBridge, and right is the result in IvyBridge machines.
We also demonstrated that our cost-based optimizer is definitely better than using a single algorithm at every stages.
So the key messages are that cost analysis should be reflecting the characteristics of the architecture such as branch misprediction and memory references, and
analyzing the behavior of algorithm can provide much more accurate cost estimation.
Based on these observations, we propose a new cost-based optimizer which is equipped with lightweight and accurate cost model.
Thank you very much to listen my presentation, and please visit my poster session today to have further discussions.
Thank you!
The analysis of computation is very complex.
State-of-the-art analysis method, named top-down is suitable for reverse engineering of computation such as finding bottlenecks, but is not suitable for cost estimation.
So thus, we translate this method into our new event-based languages.
Therefore, the new cost classifications are required to resolve this mismatch, and we introduce new kind of cost factors by using cause-based pivoting.
So we first classify the cost into base latency, branch misprediction and memory overhead.
In our work, for each algorithm, we first analysis the cost of the algorithm, then
Then find cost parameter of the algorithm.
In our work, for each algorithm, we first analysis the cost of the algorithm, then
Then find cost parameter of the algorithm.
With this definition, we can get the distribution by NHG model.
The NHG model provides probability of k ”successes” before r failures are appeared.
So we can compute the probability of SK wins Doosan by that.
Next, the distribution in k-way computation is much complex, because we may need to accumulative result in all round-robin process.
To solve this problem, we divide displacement into k subdisplacements, and solve each problem by using the markov chain.
The markov chain provides the probability of successful search and unsuccessful search, and for unsuccessful search, we can get distribution by NHG, otherwise the subdisplacement is 0.