Parallel-lazy Performance: Java 8 vs Scala vs GS Collections

4,769 views
4,544 views

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1zlqvAN.

Sponsored by Goldman Sachs. Java 8 has Streams, Scala has parallel collections, and GS Collections has ParallelIterables. Since we use parallelism to achieve better performance, it's interesting to ask: how well do they perform? We'll look at how these three APIs work with a critical eye toward performance. We'll also look at common performance pitfalls. Filmed at qconnewyork.com.

Craig Motlin is the technical lead for GS Collections, a full-featured open-source Collections library for Java, and is the author of the framework's parallel, lazy API. He has worked at Goldman Sachs for 9 years on several teams focusing on application development before moving to the JVM Architecture team to focus on framework development.

Published in: Technology
0 Comments
19 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,769
On SlideShare
0
From Embeds
0
Number of Embeds
192
Actions
Shares
0
Downloads
0
Comments
0
Likes
19
Embeds 0
No embeds

No notes for slide

Parallel-lazy Performance: Java 8 vs Scala vs GS Collections

  1. 1. This presentation reflects information available to the Technology Division of Goldman Sachs only and not any other part of Goldman Sachs. It should not be relied upon or considered investment advice. Goldman, Sachs & Co. (“GS”) does not warrant or guarantee to anyone the accuracy, completeness or efficacy of this presentation, and recipients should not rely on it except at their own risk. This presentation may not be forwarded or disclosed except with this disclaimer intact. Parallel-lazy performance Java 8 vs Scala vs GS Collections Craig Motlin June 2014
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /java-streams-scala-parallel- collections
  3. 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid
  5. 5. Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid Lots of claims and opinions
  6. 6. Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid Lots of evidence
  7. 7. Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid
  8. 8. Intro • Solve the same problem in all three libraries – Java (1.8.0_05) – GS Collections (5.1.0) – Scala (2.11.0) • Count how many even numbers are in a list of numbers • Then accomplish the same thing in parallel – Data-level parallelism – Batch the data – Use all the cores
  9. 9. Performance Factors Tests that isolate individual performance factors • Count • Filter, Transform, Transform, Filter, convert to List • Aggregation – Market value stats aggregated by product or category
  10. 10. Count: Serial long evens = arrayList.stream() .filter(each -> each % 2 == 0).count(); int evens = fastList.count(each -> each % 2 == 0); val evens = arrayBuffer.count(_ % 2 == 0)
  11. 11. Count: Serial Lazy long evens = arrayList.stream() .filter(each -> each % 2 == 0).count(); int evens = fastList.asLazy() .count(each -> each % 2 == 0); val evens = arrayBuffer.view .count(_ % 2 == 0)
  12. 12. Count: Parallel Lazy long evens = arrayList.parallelStream() .filter(each -> each % 2 == 0).count(); int evens = fastList .asParallel(executorService, BATCH_SIZE) .count(each -> each % 2 == 0); val evens = arrayBuffer.par.count(_ % 2 == 0)
  13. 13. Parallel Lazy 1 2 3 4 5 6 7 8 … 1M Filter and Count 1-10k 10k-20k 20k-30k 30k-40k … 990k-1M 500kReduce 5k 5k 5k 5k … 5k Batch
  14. 14. Parallel Eager 1 2 3 4 5 6 7 8 … 1M 1-10k 10k-20k 20k-30k 30k-40k … 990k-1M 2, 4, 6, 8 … 10k 10k-20k (evens) 20k-30k (evens) 30k-40k (evens) … 990k-1M (evens) Batch Filter Count 5k 5k 5k 5k … 5k 500kReduce
  15. 15. Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid Time for some numbers!
  16. 16. 0 50 100 150 200 250 300 350 400 Serial Lazy Java 8 GS Collections Scala Serial Count ops/s(higher is better)
  17. 17. 0 200 400 600 800 1000 1200 Serial Lazy Parallel Lazy Java 8 GS Collections Scala Parallel Count ops/s(higher is better) Measured on an 8 core Linux VM Intel Xeon E5-2697 v28x
  18. 18. Java Microbenchmark Harness “JMH is a Java harness for building, running, and analysing nano/micro/milli/macro benchmarks written in Java and other languages targetting the JVM.” • 5 forked JVMs per test • 100 warmup iterations per JVM • 50 measurement iterations per JVM • 1 second of looping per iteration http://openjdk.java.net/projects/code-tools/jmh/
  19. 19. Java Microbenchmark Harness @GenerateMicroBenchmark public void parallel_lazy_jdk() { long evens = this.integersJDK .parallelStream() .filter(each -> each % 2 == 0) .count(); Assert.assertEquals(SIZE / 2, evens); }
  20. 20. Java Microbenchmark Harness • @Setup includes megamorphic warmup • More info on megamorphic in the appendix • This is something that JMH does not handle for you!
  21. 21. Java Microbenchmark Harness • Throughput: higher is better • Enough warmup iterations so that standard deviation is low Benchmark Mode Samples Mean Mean error Units CountTest.parallel_eager_gsc thrpt 250 629.961 8.305 ops/s CountTest.parallel_lazy_gsc thrpt 250 595.023 7.153 ops/s CountTest.parallel_lazy_jdk thrpt 250 415.382 7.766 ops/s CountTest.parallel_lazy_scala thrpt 250 331.938 2.141 ops/s CountTest.serial_eager_gsc thrpt 250 115.197 0.328 ops/s CountTest.serial_eager_scala thrpt 250 91.167 0.864 ops/s CountTest.serial_lazy_gsc thrpt 250 73.625 3.619 ops/s CountTest.serial_lazy_jdk thrpt 250 58.182 0.477 ops/s CountTest.serial_lazy_scala thrpt 250 84.200 1.033 ops/s...
  22. 22. Performance Factors Tests that isolate individual performance factors • Count • Filter, Transform, Transform, Filter, convert to List • Aggregation – Market value stats aggregated by product or category
  23. 23. Java Microbenchmark Harness • Performance tests are open sourced • Read them and run them on your hardware https://github.com/goldmansachs/gs-collections/
  24. 24. Performance Factors Factors that may affect performance • Underlying container implementation • Combine strategy • Fork-join vs batching (and batch size) • Push vs pull lazy evaluation • Collapse factor • Unknown unknowns
  25. 25. Performance Factors Factors that may affect performance • Underlying container implementation • Combine strategy • Fork-join vs batching (and batch size) • Push vs pull lazy evaluation • Collapse factor • Unknown unknowns Isolated by using array-backed lists. ArrayList, FastList, and ArrayBuffer Isolated because combination of intermediate results is simple addition. Let’s look at reasons for the differences in count()
  26. 26. Count: Java 8
  27. 27. Count: Java 8 implementation @GenerateMicroBenchmark public void serial_lazy_jdk() { long evens = this.integersJDK .stream() .filter(each -> each % 2 == 0) .count(); Assert.assertEquals(SIZE / 2, evens); }
  28. 28. Count: Java 8 implementation @GenerateMicroBenchmark public void serial_lazy_jdk() { long evens = this.integersJDK .stream() .filter(each -> each % 2 == 0) .count(); Assert.assertEquals(SIZE / 2, evens); } filter(Predicate) .count() Instead of count(Predicate)
  29. 29. Count: Java 8 implementation @GenerateMicroBenchmark public void serial_lazy_jdk() { long evens = this.integersJDK .stream() .filter(each -> each % 2 == 0) .count(); Assert.assertEquals(SIZE / 2, evens); } Is count() just incrementing a counter? filter(Predicate) .count() Instead of count(Predicate)
  30. 30. Count: Java 8 implementation public final long count() { return mapToLong(e -> 1L).sum(); } public final long sum() { return reduce(0, Long::sum); } /** @since 1.8 */ public static long sum(long a, long b) { return a + b; }
  31. 31. Count: Java 8 implementation this.integersJDK .stream() .filter(each -> each % 2 == 0) .mapToLong(e -> 1L) .reduce(0, Long::sum); this.integersGSC .asLazy() .count(each -> each % 2 == 0);
  32. 32. Count: Java 8 implementation this.integersJDK .stream() .filter(each -> each % 2 == 0) .mapToLong(e -> 1L) .reduce(0, Long::sum); this.integersGSC .asLazy() .count(each -> each % 2 == 0); Seems like extra work
  33. 33. Count: GS Collections
  34. 34. Count: GS Collections @GenerateMicroBenchmark public void serial_lazy_gsc() { int evens = this.integersGSC .asLazy() .count(each -> each % 2 == 0); Assert.assertEquals(SIZE / 2, evens); }
  35. 35. Count: GS Collections AbstractLazyIterable.java public int count(Predicate<? super T> predicate) { CountProcedure<T> procedure = new CountProcedure<T>(predicate); this.forEach(procedure); return procedure.getCount(); }
  36. 36. Count: GS Collections FastList.java public void forEach(Procedure<? super T> procedure) { for (int i = 0; i < this.size; i++) { procedure.value(this.items[i]); } }
  37. 37. Count: GS Collections public class CountProcedure<T> implements Procedure<T> { private final Predicate<? super T> predicate; private int count; ... public void value(T object) { if (this.predicate.accept(object)) { this.count++; } } public int getCount() { return this.count; } }
  38. 38. Count: GS Collections public class CountProcedure<T> implements Procedure<T> { private final Predicate<? super T> predicate; private int count; ... public void value(T object) { if (this.predicate.accept(object)) { this.count++; } } public int getCount() { return this.count; } } Predicate from the test: each -> each % 2 == 0
  39. 39. Count: Scala
  40. 40. Count: Scala implementation TraversibleOnce.scala def count(p: A => Boolean): Int = { var cnt = 0 for (x <- this) if (p(x)) cnt += 1 cnt }
  41. 41. Count: Scala implementation TraversibleOnce.scala def count(p: A => Boolean): Int = { var cnt = 0 for (x <- this) if (p(x)) cnt += 1 cnt } for-comprehension becomes call to foreach() lambda closes over cnt. Executes predicate and increments cnt, just like CountProcedure
  42. 42. Count: Scala implementation public final java.lang.Object apply(java.lang.Object); 0: aload_0 1: aload_1 // Method scala/runtime/BoxesRunTime.unboxToInt:(Ljava/lang/Object;)I 2: invokestatic #32 // Method apply:(I)Z 5: invokevirtual #34 // Method scala/runtime/BoxesRunTime.boxToBoolean:(Z)Ljava/lang/Boolean; 8: invokestatic #38 11: areturn public boolean apply$mcZI$sp(int); 0: iload_1 1: iconst_2 2: irem 3: iconst_0 4: if_icmpne 11 7: iconst_1 8: goto 12 11: iconst_0 12: ireturn public final boolean apply(int); 0: aload_0 1: iload_1 // Method apply$mcZI$sp:(I)Z 2: invokevirtual #21 5: ireturn
  43. 43. Count: Scala implementation Integer int booleanBoolean Integer.intValue() Lambda: _ % 2 == 0 Bytecode: irem Boolean.valueOf(boolean)
  44. 44. Performance Factors Factors that may affect performance • Underlying container implementation • Combine strategy • Fork-join vs batching (and batch size) • Push vs pull lazy evaluation • Collapse factor • Unknown unknowns Scala’s auto-boxing Java’s pull lazy evaluation
  45. 45. Performance Factors Tests that isolate individual performance factors • Count • Filter, Transform, Transform, Filter, convert to List • Aggregation – Market value stats aggregated by product or category
  46. 46. Parallel / Lazy / JDK List<Integer> list = this.integersJDK .parallelStream() .filter(each -> each % 10_000 != 0) .map(String::valueOf) .map(Integer::valueOf) .filter(each -> (each + 1) % 10_000 != 0) .collect(Collectors.toList()); Verify.assertSize(999_800, list);
  47. 47. Parallel / Lazy / GSC MutableList<Integer> list = this.integersGSC .asParallel(this.executorService, BATCH_SIZE) .select(each -> each % 10_000 != 0) .collect(String::valueOf) .collect(Integer::valueOf) .select(each -> (each + 1) % 10_000 != 0) .toList(); Verify.assertSize(999_800, list);
  48. 48. Parallel / Lazy / Scala val list = this.integers .par .filter(each => each % 10000 != 0) .map(String.valueOf) .map(Integer.valueOf) .filter(each => (each + 1) % 10000 != 0) .toBuffer Assert.assertEquals(999800, list.size)
  49. 49. 0 10 20 30 40 50 60 Serial Lazy Parallel Lazy Java 8 GS Collections Scala Stacked computation ops/s (higher is better) 8x
  50. 50. Parallel / Lazy / JDK List<Integer> list = this.integersJDK .parallelStream() .filter(each -> each % 10_000 != 0) .map(String::valueOf) .map(Integer::valueOf) .filter(each -> (each + 1) % 10_000 != 0) .collect(Collectors.toList()); Verify.assertSize(999_800, list);
  51. 51. Parallel / Lazy / JDK List<Integer> list = this.integersJDK .parallelStream() .filter(each -> each % 10_000 != 0) .map(String::valueOf) .map(Integer::valueOf) .filter(each -> (each + 1) % 10_000 != 0) .collect(Collectors.toList()); Verify.assertSize(999_800, list); ArrayList::new List::add (left, right) -> { left.addAll(right); return left; }
  52. 52. Fork-Join Merge • Intermediate results are merged in a tree • Merging is O(n log n) work and garbage
  53. 53. Fork-Join Merge • Amount of work done by last thread is O(n)
  54. 54. Parallel / Lazy / GSC MutableList<Integer> list = this.integersGSC .asParallel(this.executorService, BATCH_SIZE) .select(each -> each % 10_000 != 0) .collect(String::valueOf) .collect(Integer::valueOf) .select(each -> (each + 1) % 10_000 != 0) .toList(); Verify.assertSize(999_800, list); ParallelIterable.toList() returns a CompositeFastList, a List with O(1) implementation of addAll()
  55. 55. Parallel / Lazy / GSC public final class CompositeFastList<E> { private final FastList<FastList<E>> lists = FastList.newList(); public boolean addAll(Collection<? extends E> collection) { FastList<E> collectionToAdd = collection instanceof FastList ? (FastList<E>) collection : new FastList<E>(collection); this.lists.add(collectionToAdd); return true; } ... }
  56. 56. CompositeFastList Merge • Merging is O(1) work per batch CFL
  57. 57. Performance Factors Factors that may affect performance • Underlying container implementation • Combine strategy • Fork-join vs batching (and batch size) • Push vs pull lazy evaluation • Collapse factor • Unknown unknowns Fork-join is general purpose but requires merge work Specialized data structures meant for combining
  58. 58. Thread Pools
  59. 59. Parallel: GSC .asParallel(this.executorService, BATCH_SIZE) • You must specify your own batch size – 10,000 is fine – size / (8 * #cores) is fine • You must specify your own thread pool – Can share, or not – Can tailor for CPU-bound Executors.newFixedThreadPool( Runtime.getRuntime().availableProcessors()) – Or IO-Bound Executors.newFixedThreadPool(maxDbConnections)
  60. 60. Parallel: Scala • One shared fork-join pool, configurable • Batch sizes are dynamic and respond to work stealing • Minimum batch size: 1 + size / (8 * #cores)
  61. 61. Parallel: Java 8 • One shared fork-join pool, not configurable • Batch sizes are dynamic and respond to work stealing • Minimum batch size: – max(1, size / (4 * (#cores - 1))) – Default pool also has #cores – 1 threads, plus main thread helps – Can be changed with system property java.util.concurrent.ForkJoinPool.common.parallelism
  62. 62. Aggregation
  63. 63. Aggregation Domain
  64. 64. 0 10 20 30 40 50 60 70 Serial Lazy Parallel Lazy Java 8 GS Collections Aggregate by Categories 8x
  65. 65. 0 5 10 15 20 25 Serial Lazy Parallel Lazy Java 8 GS Collections Aggregate by Accounts 8x
  66. 66. Aggregate by Category Streams Map<String, DoubleSummaryStatistics> categoryDoubleMap = this.jdkPositions.parallelStream().collect( Collectors.groupingBy( Position::getCategory, Collectors.summarizingDouble(Position::getMarketValue)));
  67. 67. Aggregate by Category GSC MapIterable<String, MarketValueStatistics> categoryDoubleMap = this.gscPositions.asParallel(this.executorService, BATCH_SIZE) .aggregateInPlaceBy( Position::getCategory, MarketValueStatistics::new, MarketValueStatistics::acceptThis);
  68. 68. Aggregate by Category GSC MapIterable<String, MarketValueStatistics> categoryDoubleMap = this.gscPositions.asParallel(this.executorService, BATCH_SIZE) .aggregateInPlaceBy( Position::getCategory, MarketValueStatistics::new, MarketValueStatistics::acceptThis); What if we group by Account instead?
  69. 69. Aggregate by Account GSC MapIterable<Account, MarketValueStatistics> accountDoubleMap = this.gscPositions.asParallel(this.executorService, BATCH_SIZE) .aggregateInPlaceBy( Position::getAccount, MarketValueStatistics::new, MarketValueStatistics::acceptThis); What if we group by Account instead?
  70. 70. Collapse factor MapIterable<String, MarketValueStatistics> categoryDoubleMap There are 26 categories, so the map has 26 keys MapIterable<Account, MarketValueStatistics> accountDoubleMap There are 100k accounts, so the map has 100k keys
  71. 71. Collapse factor • Aggregate: Java Streams – Uses fork/join – Each forked task creates a map – Each join step merges two maps – The joined map is roughly the same size – Merge is costly when there are many keys • Aggregate: GS Collections – Uses a single ConcurrentMap for the results – Each batched task writes into the map simultaneously with atomic operation ConcurrentHashMapUnsafe.updateValueWith() – Contention is costly when there are few keys
  72. 72. Collapse factor • Aggregate: Java Streams – Uses fork/join – Each forked task creates a map – Each join step merges two maps – The joined map is roughly the same size – Merge is costly when there are many keys • Aggregate: GS Collections – Uses a single ConcurrentMap for the results – Each batched task writes into the map simultaneously with atomic operation ConcurrentHashMapUnsafe.updateValueWith() – Contention is costly when there are few keys See Mohammad Rezaei’s presentation from QCon 2012 called “Fine Grained Coordinated Parallelism in a Real World Application.”
  73. 73. Performance Factors Factors that may affect performance • Underlying container implementation • Combine strategy • Fork-join vs batching (and batch size) • Push vs pull lazy evaluation • Collapse factor • Unknown unknowns Test groupBy, aggregateBy
  74. 74. Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid
  75. 75. Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid
  76. 76. Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid
  77. 77. Q&A
  78. 78. Q&A http://github.com/goldmansachs/gs-collections http://github.com/goldmansachs/gs-collections-kata @GoldmanSachs http://stackoverflow.com/questions/tagged/gs-collections craig.motlin@gs.com Info in appendix • Sets • Handcoded parallelism • Megamorphic warmup
  79. 79. Appendix
  80. 80. Hashtable Sets
  81. 81. Performance Factors Factors that may affect performance • Underlying container implementation • Combine strategy • Fork-join vs batching (and batch size) • Push vs pull lazy evaluation • Collapse factor • Unknown unknowns Isolated by using array-backed lists. ArrayList, FastList, and ArrayBuffer What if we use Java’s HashSet, Scala’s HashSet, and GS Collections’ UnifiedSet?
  82. 82. 0 200 400 600 800 1000 1200 Serial Lazy Parallel Lazy Java 8 GS Collections Scala Parallel Count ops/s(higher is better) Measured on an 8 core Linux VM Intel Xeon E5-2697 v28x Lists: FastList | ArrayList | ArrayBuffer
  83. 83. 0 200 400 600 800 1000 1200 Serial Lazy Parallel Lazy Java 8 GS Collections Scala Parallel Count ops/s(higher is better) Sets: UnifiedSet | HashSet (Java’s) | HashSet (Scala’s) 8x
  84. 84. Parallel / Lazy / GSC MutableList<Integer> list = this.integersGSC .asParallel(this.executorService, BATCH_SIZE) .select(each -> each % 10_000 != 0) .collect(String::valueOf) .collect(Integer::valueOf) .select(each -> (each + 1) % 10_000 != 0) .toSet(); Verify.assertSize(999_800, list); ParallelIterable.toSet() uses a concurrent set. No combination step. No preserving order.
  85. 85. Hand coded parallelism
  86. 86. Hand coded Parallel / Lazy MutableList<Integer> list = this.integersGSC .asParallel(this.executorService, BATCH_SIZE) .select(integer -> integer % 10_000 != 0 && (Integer.valueOf(String.valueOf(integer)) + 1) % 10_000 != 0) .toList(); Verify.assertSize(999_800, list);
  87. 87. Stacked computation ops/s (higher is better) 8x 0 10 20 30 40 50 60 70 Serial Lazy Parallel Lazy Parallel hand-coded Java 8 GS Collections Scala
  88. 88. Method inlining
  89. 89. Count: SAM method calls • Let’s take a closer look at both implementations of count() • Let’s assume that @FunctionalInterface method calls are costly and count them as we go • We’ll revisit this assumption
  90. 90. Count: GS Collections java.lang.Thread.State: RUNNABLE at com.gs.collections.impl.block.procedure.CountProcedure.value(CountProcedure.java:47) at com.gs.collections.impl.list.mutable.FastList.forEach(FastList.java:623) at com.gs.collections.impl.utility.Iterate.forEach(Iterate.java:114) at com.gs.collections.impl.lazy.LazyIterableAdapter.forEach(LazyIterableAdapter.java:49) at com.gs.collections.impl.lazy.AbstractLazyIterable.count(AbstractLazyIterable.java:461) at com.gs.collections.impl.jmh.CountTest.serial_lazy_gsc(CountTest.java:302) • Execution of the lazy evaluation • Executed once per element • We’ll look for @FunctionalInterface method calls here
  91. 91. Count: GS Collections Grand total of 2 @FunctionalInterface method calls
  92. 92. Count: Java 8 java.lang.Thread.State: RUNNABLE at java.lang.Long.sum(Long.java:1587) at java.util.stream.LongPipeline$$Lambda$3.887750041.applyAsLong(Unknown Source:-1) at java.util.stream.ReduceOps$8ReducingSink.accept(ReduceOps.java:394) at java.util.stream.ReferencePipeline$5$1.accept(ReferencePipeline.java:227) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1359) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:512) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:502) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.LongPipeline.reduce(LongPipeline.java:438) at java.util.stream.LongPipeline.sum(LongPipeline.java:396) at java.util.stream.ReferencePipeline.count(ReferencePipeline.java:526) at com.gs.collections.impl.jmh.CountTest.serial_lazy_jdk(CountTest.java:278) • Execution of the pipeline • Executed once per element • We’ll look for @FunctionalInterface method calls here
  93. 93. Count: Java 8 Grand total of 6 @FunctionalInterface method calls
  94. 94. Count: Scala Scala implementation is similar to GS Collections Grand total of 2 @FunctionalInterface method calls
  95. 95. @FunctionalInterface method calls • Why do we care about @FunctionalInterface method calls? • The JDK inlines short method bodies like our Predicates • The exact nature of the inlining has a dramatic impact on performance
  96. 96. @FunctionalInterface method calls • JMH forks a new JVM for each test • During both stages of JIT compilation, this.predicate is our test Predicate • The JVM will perform monomorphic inlining public void value(T object) { if (this.predicate.accept(object)) { this.count++; } } Predicate from the test: each -> each % 2 == 0
  97. 97. @FunctionalInterface method calls The dispatch algorithm in pseudo code if (this.predicate instanceof lambda$serial_lazy_gsc$1) { if (object % 2 == 0) { this.count++; } } else { [recompile] if (this.predicate.accept(object)) { this.count++; } }
  98. 98. @FunctionalInterface method calls • The next recompilation will result in bimorphic inlining • The next recompilation will result in megamorphic method dispatch • Classic table lookup and jump • In other words, no inlining • Dramatic performance penalty for fast methods like count()
  99. 99. Megamorphic method dispatch How do we trigger megamorphic deoptimization? @Setup(Level.Trial) public void setUp_megamorphic() { long evens = this.integersJDK.stream().filter(each -> each % 2 == 0).count(); Assert.assertEquals(SIZE / 2, evens); long odds = this.integersJDK.stream().filter(each -> each % 2 == 1).count(); Assert.assertEquals(SIZE / 2, odds); long evens2 = this.integersJDK.stream().filter(each -> (each & 1) == 0).count(); Assert.assertEquals(SIZE / 2, evens2); } This is something that JMH does not handle for you!
  100. 100. 0 50 100 150 200 250 300 350 400 Serial Lazy Megamorphic Serial Lazy Java 8 GS Collections Scala Megamorphic Count ops/s(higher is better) 8x
  101. 101. Megamorphic method dispatch • Why force megamorphic deoptimization? • Some implementations will have extra virtual method calls (@FunctionalInterface method calls) • Microbenchmarks aren’t realistic, but which is more realistic (less unrealistic?) • You will trigger this deoptimization in normal production code, as soon as there is more than one call to this api anywhere in the executed code
  102. 102. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/java- streams-scala-parallel-collections

×