Engineering Java 7's Dual Pivot Quicksort Using MaLiJAn

Engineering Java 7’s Dual Pivot Quicksort
Using MaLiJAn

Sebastian Wild Markus E. Nebel Raphael Reitzig Ulrich Laube
[wild, nebel, r_reitzi, laube] @cs.uni-kl.de

Computer Science Department
University of Kaiserslautern

January 7, 2013
Meeting on Algorithm Engineering & Experiments 2013

Sebastian Wild Java 7’s Dual Pivot Quicksort 2012/09/11 1 / 23

Background

Since Java 7: new dual pivot Quicksort in JRE library
Basic algorithm by Vladimir Yaroslavskiy
Optimizations by Jon Bentley, Joshua Bloch and others
(see java.core-libs.devel mailing list)
Motivated by experience with classic Quicksort
Validated by running time benchmark

In this talk:
Can we exploit special properties of dual pivot Quicksort?
Can we get more insight than running time measurements?
. . . stay tuned


Java 7’s Dual Pivot Quicksort – Example

Yaroslavskiy’s Dual Pivot Quicksort
(used in Oracle’s Java 7 Arrays.sort(int[]))

p q
3 5 1 8 4 7 2 9 6

Select two elements as pivots.

Invariant: q
→ → ←



p q
3 5 1 8 4 7 2 9 6

Only value relative to pivot counts.

→ → ←




k

3 5 1 8 4 7 2 9 6

A[k] is medium go on

→ → ←




k

3 5 1 8 4 7 2 9 6

A[k] is small Swap to left

→ → ←




k

3 5 1 8 4 7 2 9 6

Swap small element to left end.

→ → ←




k

3 1 5 8 4 7 2 9 6

Swap small element to left end.

→ → ←




k

3 1 5 8 4 7 2 9 6

A[k] is large Find swap partner.

→ → ←




k g

3 1 5 8 4 7 2 9 6

A[k] is large Find swap partner:
g skips over large elements.

→ → ←




k g

3 1 5 8 4 7 2 9 6

A[k] is large Swap

→ → ←




k g

3 1 5 2 4 7 8 9 6

A[k] is large Swap

→ → ←




k g

3 1 5 2 4 7 8 9 6

A[k] is old A[g], small Swap to left

→ → ←




k g

3 1 2 5 4 7 8 9 6

A[k] is old A[g], small Swap to left

→ → ←




k g

3 1 2 5 4 7 8 9 6

A[k] is medium go on

→ → ←




k g

3 1 2 5 4 7 8 9 6

A[k] is large Find swap partner.

→ → ←




g k

3 1 2 5 4 7 8 9 6

A[k] is large Find swap partner:
g skips over large elements.

→ → ←




g k

3 1 2 5 4 7 8 9 6

g and k have crossed!
Swap pivots in place

→ → ←




g k

2 1 3 5 4 6 8 9 7

g and k have crossed!
Swap pivots in place

→ → ←




2 1 3 5 4 6 8 9 7

Partitioning done!

→ → ←




2 1 3 5 4 6 8 9 7

Recursively sort three sublists.

→ → ←




1 2 3 4 5 6 7 8 9

Done.

→ → ←


Control Flow Graph of Partitioning Loop

1 bc: 3 no
k g
yes
2 bc: 7
t := A[k]; 7 bc: 2
yes t q k<g
:= + 1;
no
no no
8 bc: 5
A[g] < p
yes no

9 bc: 14 10 bc: 6
A[k] := A[ ]; A[k] := A[g]
A[ ] := A[g]
:= + 1;

11 bc: 5
12 bc: 2
A[g] := t;
k := k + 1
g := g − 1;

1 bc: 3 no
k g Cycle 1
yes
2 bc: 7
7 bc: 2
A[k]: small
t := A[k];
yes g := g − 1;
t q k<g
∆(g − k): 1
no
no no
8 bc: 5
A[g] < p Bytecode
yes no
Instructions: 24
9 bc: 14 10 bc: 6
A[k] := A[ ]; A[k] := A[g]
A[ ] := A[g]
:= + 1;

11 bc: 5
12 bc: 2
A[g] := t;
k := k + 1
g := g − 1;

1 bc: 3 no
k g Cycle 2
yes
2 bc: 7
7 bc: 2
A[k]: medium
t := A[k];
yes g := g − 1;
t q k<g
∆(g − k): 1
no
no no
8 bc: 5
A[g] < p Bytecode
yes no
Instructions: 15
9 bc: 14 10 bc: 6
A[k] := A[ ]; A[k] := A[g]
A[ ] := A[g]
:= + 1;

11 bc: 5
12 bc: 2
A[g] := t;
k := k + 1
g := g − 1;

1 bc: 3 no
k g Cycle 3
yes
2 bc: 7
7 bc: 2
A[k]: large
t := A[k];
yes g := g − 1;
t q k<g
∆(g − k): 1
no
no no
8 bc: 5
A[g] < p Bytecode
yes no
Instructions: 10
9 bc: 14 10 bc: 6
A[k] := A[ ]; A[k] := A[g]
A[ ] := A[g]
:= + 1;

11 bc: 5
12 bc: 2
A[g] := t;
k := k + 1
g := g − 1;

1 bc: 3 no
k g Cycle 4
yes
2 bc: 7
7 bc: 2
A[k]: large
t := A[k];
yes g := g − 1;
t q k<g
∆(g − k): 2
no
no no
8 bc: 5
A[g] < p Bytecode
yes no
Instructions: 44
9 bc: 14 10 bc: 6
A[k] := A[ ]; A[k] := A[g]
A[ ] := A[g]
:= + 1;

11 bc: 5
12 bc: 2
A[g] := t;
k := k + 1
g := g − 1;

1 bc: 3 no
k g Cycle 5
yes
2 bc: 7
7 bc: 2
A[k]: large
t := A[k];
yes g := g − 1;
t q k<g
∆(g − k): 2
no
no no
8 bc: 5
A[g] < p Bytecode
yes no
Instructions: 36
9 bc: 14 10 bc: 6
A[k] := A[ ]; A[k] := A[g]
A[ ] := A[g]
:= + 1;

11 bc: 5
12 bc: 2
A[g] := t;
k := k + 1
g := g − 1;

Asymmetry

1 bc: 3 no
k g

2
yes
bc: 7
Algorithm is asymmetric:
t := A[k]; 7 bc: 2
yes t q k<g
:= + 1;
no
no
no
ones often
8 bc: 5

yes
A[g] < p
no Cycles chosen by classes
9 bc: 14
A[k] := A[ ];
10 bc: 6
A[k] := A[g] small , medium or large
A[ ] := A[g]
:= + 1;
Probability for classes depends
12 bc: 2
k := k + 1
11 bc: 5
A[g] := t;
on pivot values
g := g − 1;

Maybe we can “inﬂuence pivot values accordingly”?

Pivot Sampling

Well-known optimization for classic Quicksort: median-of-three
pivot closer to median of whole list

In JRE7 Quicksort implementation: natural extension for 2 pivots:

tertiles-of-ﬁve
pivots closer to tertiles of whole list

9 other possibilities to pick p and q out of 5 elements:


Pivot Sampling

Well-known optimization for classic Quicksort: median-of-three
pivot closer to median of whole list

In JRE7 Quicksort implementation: natural extension for 2 pivots:

p q

tertiles-of-ﬁve
pivots closer to tertiles of whole list

9 other possibilities to pick p and q out of 5 elements:


Optimizing Pivot Sampling

Which are “good” pivot selection schemes?
Is the symmetric choice best possible?

Need objective function to optimize
Typical approaches to judge efﬁciency:
A Count number of basic operations.
(Here: number of executed Java Bytecode instructions.)
B Measure total running time.


Optimizing Pivot Sampling
Relative performance of pivot sampling compared to tertiles-of-ﬁve:
Pivot Selection Scheme A 1 B 2

JRE7
+5.14% +0.80%

JRE7(1,3) −1.85% −0.44%

+3.34% −0.42%

— (stack overﬂow!) +10.6%

+2.48% +2.73%

+11.3% +3.31%

+12.7% +3.29%

+16.4% +2.48%

+39.0% +5.87%

1
Average number of executed bytecodes on almost sorted lists of length 105 .
2
Average running time on random permutations of length 106 .

Methods


Model and Method
What made JRE7(1,3) faster than JRE7 ?
. . . hard to tell from total time/bytecodes.
Need a more detailed model of the program.

Idea: Decompose along control ﬂow graph!
1
View program as Markov chain over blocks
2 7
Termination via absorbing state
3 4 5 6
Transition i → j has probability p(n)
i→j
8 depending on input size n
9 10 Visiting block i incurs constant costs c(i)
12 11 Total cost is sum of block costs

Expected costs of program = expected costs of run of Markov chain
Latter easy to compute

Maximum Likelihood Analysis
How to determine block costs and transition probabilities?
Transition Probabilities
Count transitions in executions on sample data
1 Allows arbitrary input distributions!
2 Take relative frequency as estimate for p(n)
i→j
Extrapolate p(n) to a function pi→j (n) in n
i→j

Block Costs
We consider two cost measures:
1
A bc(i) = number of Bytecodes instructions in block i.
2
B t(i) = running time of block i
All steps are automated in our tool MaLiJAn3

3
http://wwwagak.cs.uni-kl.de/malijan.html

Block Sampling
Running times t(i) in B are typically few nanoseconds
direct measurement not possible.

Idea: Sampling Based Approach
12 11 12
ns
1 2 3 1 2 4 5 6 7 5 6 7 5 6 7 8 10 1

time µs
sampling 3 2 6 5 5 8 10

In regular intervals, store current basic block (concurrently)
We observe only ≈ 1 of all blocks repeat execution
Relative frequencies of observed samples approach
relative running time contribution of blocks.

Count in separate run how often block i gets executed in total
Together, this allows to compute t(i)

A Decent Word of Caution

1 Determining current block adds a small systematic error.
2 Java Specialty: Just-in-time Compilation
Running time heavily inﬂuenced by HotSpot JIT compiler
JIT collects proﬁling information at beginning
First input determines which optimizations are found
. . . more details in the paper


Input Distributions

We consider 2 different input distributions:
1 Random Permutations
well-studied in literature
2 Almost Sorted Lists
Random model by Brodal et al.4 :
A[i] chosen i. i. d. uniform in [i − d, i + d]
for constant d (here d = 100)

4
G. Brodal, R. Fagerberg, G. Moruz: On the Adaptiveness of Quicksort,
J. Exp. Algorithmics 12 (2008), pp. 3.2:1–3.2:20

Results


Asymptotic Expected Costs
Measure Algorithm Random Permutations Almost Sorted Lists

JRE7 19.40 n ln n + 51 n
Bytecodes A
JRE7(1,3) 18.73 n ln n + 62 n

JRE7
time -Xcomp B
JRE7(1,3)

JRE7
time warmup B
JRE7(1,3)

24 log. plot, normalized by n ln n
JRE7, JRE7(1,3)
23 model ﬁts data well!

22
105 106 107 108


JRE7 19.40 n ln n + 51 n
Bytecodes A
JRE7(1,3) 18.73 n ln n + 62 n

JRE7
time -Xcomp B
JRE7(1,3) 19.40 n ln n + 51 n
18.73 n ln n + 62 n
24 JRE7
time warmup B JRE7
JRE7(1,3)
JRE7(1,3)
n ln n
bc

24 23 log. plot, normalized by n ln n
JRE7, JRE7(1,3)
23 model ﬁts data well!
22
22
105 106 107 108
105 106 107 108 n


JRE7 19.40 n ln n + 51 n 15.10 n ln n + 68 n
Bytecodes A
JRE7(1,3) 18.73 n ln n + 62 n 13.52 n ln n + 85 n

JRE7
time -Xcomp B
JRE7(1,3)

JRE7
time warmup B
JRE7(1,3)

21
log. plot, normalized by n ln n
20 JRE7, JRE7(1,3)
model ﬁts data well!
19

18
105 106 107 108


JRE7 19.40 n ln n + 51 n 15.10 n ln n + 68 n
Bytecodes A
JRE7(1,3) 18.73 n ln n + 62 n 13.52 n ln n + 85 n

JRE7
time -Xcomp B
JRE7(1,3)

JRE7
time warmup B
JRE7(1,3)

asymptotically, JRE7(1,3) executes less Bytecodes!

Can we explain, why?


Cycle Costs
· cost(Cycle 5)

1 In #Bytecodes:
Cycle 3 cheapest
0.5 Cycle 1 most expensive
of all cycles
0
bc
-Xcomp with warmup

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
1 1 1 1 1

2 7 2 7 2 7 2 7 2 7

3 4 5 6 3 4 5 6 3 4 5 6 3 4 5 6 3 4 5 6

8 8 8 8 8

9 10 9 10 9 10 9 10 9 10

12 11 12 11 12 11 12 11 12 11


Asymptotic Cycle Frequencies
· n ln n + O(n)

0.4
JRE7(1,3) executes
Cycle 3 more often
0.2
Cycle 1 less often
than JRE7
0
JRE7 JRE7(1,3) JRE7 JRE7(1,3)
random permutations almost sorted

1 1 1 1 1

2 7 2 7 2 7 2 7 2 7

3 4 5 6 3 4 5 6 3 4 5 6 3 4 5 6 3 4 5 6

8 8 8 8 8

9 10 9 10 9 10 9 10 9 10

12 11 12 11 12 11 12 11 12 11


Asymptotic Cycle Frequencies
· n ln n + O(n)

0.4
JRE7(1,3) executes
Cycle 3 more often
0.2
Cycle 1 less often
than JRE7
0
JRE7 JRE7(1,3) JRE7 JRE7(1,3)
JRE7(1,3)
random permutations executes cheap Cycle 3 more often
almost sorted
and expensive Cycle 1 less often than JRE7.
Asymptotically, less executed Bytecodes!
1 1 1 1 1

2 7 2 7 2 7 2 7 2 7

3 4 5 6 3 4 5 6 3 4 5 6 3 4 5 6 3 4 5 6

8 8 8 8 8

9 10 9 10 9 10 9 10 9 10

12 11 12 11 12 11 12 11 12 11


Running Time Results

How about running time?

HotSpot JIT compiler has two modes
-Xcomp JIT compiler without profiling information
warmup profiling JIT with warmup on fixed input
trigger JIT compilation

Do Block Sampling for both modes

Should we expect same block running times?
. . . stay tuned


Cycle Costs
· cost(Cycle 5)

1

0.5

0
bc
-Xcomp with warmup

1 1 1 1 1

2 7 2 7 2 7 2 7 2 7

3 4 5 6 3 4 5 6 3 4 5 6 3 4 5 6 3 4 5 6

8 8 8 8 8

9 10 9 10 9 10 9 10 9 10

12 11 12 11 12 11 12 11 12 11


Cycle Costs
· cost(Cycle 5)

1
measures agree
qualitatively
0.5
but:
smaller difference
0
bc tJRE7 tJRE7(1,3) tJRE7
-Xcomp with warmup

1 1 1 1 1

2 7 2 7 2 7 2 7 2 7

3 4 5 6 3 4 5 6 3 4 5 6 3 4 5 6 3 4 5 6

8 8 8 8 8

9 10 9 10 9 10 9 10 9 10

12 11 12 11 12 11 12 11 12 11



JRE7 19.40 n ln n + 51 n 15.10 n ln n + 68 n
Bytecodes A
JRE7(1,3) 18.73 n ln n + 62 n 13.52 n ln n + 85 n

JRE7 20.10 n ln n + 26 n 11.95 n ln n + 54 n
time -Xcomp B
JRE7(1,3) 19.95 n ln n + 32 n 11.09 n ln n + 64 n

JRE7
time warmup B
JRE7(1,3)

18

24 17

16
22
15

20 14
105 106 107 108 105 106 107 108


JRE7 19.40 n ln n + 51 n 15.10 n ln n + 68 n
Bytecodes A
JRE7(1,3) 18.73 n ln n + 62 n 13.52 n ln n + 85 n

JRE7 20.10 n ln n + 26 n 11.95 n ln n + 54 n
time -Xcomp B
JRE7(1,3) 19.95 n ln n + 32 n 11.09 n ln n + 64 n

JRE7
time warmup B
JRE7(1,3)

18

24 17 JIT without proﬁling
16
22
15 asymptotically, JRE7(1,3) faster!
20 14
105 106 107 108 105 106 107 108



JRE7 19.40 n ln n + 51 n 15.10 n ln n + 68 n
Bytecodes A
JRE7(1,3) 18.73 n ln n + 62 n 13.52 n ln n + 85 n

JRE7 20.10 n ln n + 26 n 11.95 n ln n + 54 n
time -Xcomp B
JRE7(1,3) 19.95 n ln n + 32 n 11.09 n ln n + 64 n

JRE7 10.02 n ln n + 9 n 5.52 n ln n + 13 n
time warmup B
JRE7(1,3) 11.39 n ln n + 15 n 5.38 n ln n + 19 n

8

12

6
10

4
105 106 107 108 105 106 107 108


JRE7 19.40 n ln n + 51 n 15.10 n ln n + 68 n
Bytecodes A
JRE7(1,3) 18.73 n ln n + 62 n 13.52 n ln n + 85 n

JRE7 20.10 n ln n + 26 n 11.95 n ln n + 54 n
time -Xcomp B
JRE7(1,3) 19.95 n ln n + 32 n 11.09 n ln n + 64 n

JRE7 10.02 n ln n + 9 n 5.52 n ln n + 13 n
time warmup B
JRE7(1,3) 11.39 n ln n + 15 n 5.38 n ln n + 19 n

8

12 JIT with proﬁling and warmup
6
10
asymptotically, JRE7(1,3) slower!
4
105 106 107 108 105 106 107 108



JRE7 19.40 n ln n + 51 n 15.10 n ln n + 68 n
Bytecodes A
JRE7(1,3) 18.73 n ln n + 62 n 13.52 n ln n + 85 n

JRE7 20.10 n ln n + 26 n 11.95 n ln n + 54 n
time -Xcomp B
JRE7(1,3) 19.95 n ln n + 32 n 11.09 n ln n + 64 n

JRE7 10.02 n ln n + 9 n 5.52 n ln n + 13 n
time warmup B
JRE7(1,3) 11.39 n ln n + 15 n 5.38 n ln n + 19 n

8

6
10
4
105 106 107 108 105 106 107 108

What changes with proﬁling enabled?


Cycle Costs
· cost(Cycle 5)

1
measures agree
qualitatively
0.5

0
bc tJRE7 tJRE7(1,3) tJRE7
-Xcomp with warmup

1 1 1 1 1

2 7 2 7 2 7 2 7 2 7

3 4 5 6 3 4 5 6 3 4 5 6 3 4 5 6 3 4 5 6

8 8 8 8 8

9 10 9 10 9 10 9 10 9 10

12 11 12 11 12 11 12 11 12 11


Cycle Costs
· cost(Cycle 5)

1
measures agree
qualitatively
0.5
except for JRE7(1,3)
with proﬁling JIT!
0
bc tJRE7 tJRE7(1,3) tJRE7 tJRE7(1,3)
-Xcomp with warmup

1 1 1 1 1

2 7 2 7 2 7 2 7 2 7

3 4 5 6 3 4 5 6 3 4 5 6 3 4 5 6 3 4 5 6

8 8 8 8 8

9 10 9 10 9 10 9 10 9 10

12 11 12 11 12 11 12 11 12 11


Cycle Costs
· cost(Cycle 5)

1
measures agree
qualitatively
0.5
except for JRE7(1,3)
with proﬁling JIT!
0
bc tJRE7 tJRE7 tJRE7 tJRE7
For -Xcomp (1,3), the code created by proﬁling JIT
JRE7(1,3) (1,3)
with warmup
for Cycle 3 is much slower than for JRE7!
That’s the place to focus future research on.
1 1 1 1 1

2 7 2 7 2 7 2 7 2 7

3 4 5 6 3 4 5 6 3 4 5 6 3 4 5 6 3 4 5 6

8 8 8 8 8

9 10 9 10 9 10 9 10 9 10

12 11 12 11 12 11 12 11 12 11


Conclusion

Summary
Java 7’s dual pivot Quicksort is highly asymmetric.
executes less Bytecodes than .
Almost sorted inputs amplify impact of pivot sampling.
Oracle’s profiling JIT compiler creates different code for JRE7(1,3) ,
which potentially overcompensates gains.
Control flow graph decomposition supported by MaLiJAn makes
difference in code efficiency directly visible.

Open Problems
? What causes different costs for Cycle 3?
? Are the differences idiosyncracies of Java / Oracle’s JRE?
? Performance of JRE7(1,3) on other inputs, especially with equal keys?



JRE7 19.40 n ln n + 51 n 15.10 n ln n + 68 n
Bytecodes A
JRE7(1,3) 18.73 n ln n + 62 n 13.52 n ln n + 85 n

JRE7 20.10 n ln n + 26 n 11.95 n ln n + 54 n
time -Xcomp B
JRE7(1,3) 19.95 n ln n + 32 n 11.09 n ln n + 64 n

JRE7 10.02 n ln n + 9 n 5.52 n ln n + 13 n
time warmup B
JRE7(1,3) 11.39 n ln n + 15 n 5.38 n ln n + 19 n

8

6
10
4
105 106 107 108 105 106 107 108

What changes with proﬁling enabled?


Conclusion

Summary
Java 7’s dual pivot Quicksort is highly asymmetric.
executes less Bytecodes than .
Almost sorted inputs amplify impact of pivot sampling.
Oracle’s profiling JIT compiler creates different code for JRE7(1,3) ,
which potentially overcompensates gains.
Control flow graph decomposition supported by MaLiJAn makes
difference in code efficiency directly visible.

Open Problems
? What causes different costs for Cycle 3?
? Are the differences idiosyncracies of Java / Oracle’s JRE?
? Performance of JRE7(1,3) on other inputs, especially with equal keys?


Engineering Java 7's Dual Pivot Quicksort Using MaLiJAn

Recommended

Recommended

More Related Content

More from Sebastian Wild

More from Sebastian Wild (8)

Recently uploaded

Recently uploaded (20)

Engineering Java 7's Dual Pivot Quicksort Using MaLiJAn