Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Shooting the Rapids

883 views

Published on

Improved version of JavaOne talk, delivered at Devoxx Belgium, November 2015

Published in: Software
  • Be the first to comment

Shooting the Rapids

  1. 1. Shooting the Rapids: Getting the Best from Java 8 Streams Kirk Pepperdine @kcpeppe Maurice Naftalin @mauricenaftalin Devoxx Belgium, Nov. 2015
  2. 2. • Specialises in performance tuning • speaks frequently about performance • author of performance tuning workshop • Co-founder • performance diagnostic tooling • Java Champion (since 2006) About Kirk
  3. 3. • Specialises in performance tuning • speaks frequently about performance • author of performance tuning workshop • Co-founder • performance diagnostic tooling • Java Champion (since 2006) About Kirk
  4. 4. About Maurice
  5. 5. About Maurice
  6. 6. About Maurice Co-author Author
  7. 7. About Maurice Co-author Author Java Champion JavaOne Rock Star
  8. 8. Subjects Covered in this Talk • Background – lambdas and streams • Performance of our example • Effect of parallelizing • Splitting input data efficiently • When to go parallel • Parallel streams in the real world
  9. 9. Benchmark Alert
  10. 10. Predicate<Matcher> matches = new Predicate<Matcher>() {
 @Override
 public boolean test(Matcher matcher) {
 return matcher.find();
 }
 }; What is a Lambda? matcher matcher.find() matcher matcher.find()
  11. 11. Predicate<Matcher> matches = new Predicate<Matcher>() {
 @Override
 public boolean test(Matcher matcher) {
 return matcher.find();
 }
 }; Predicate<Matcher> matches = What is a Lambda? matcher matcher.find() matcher matcher.find()
  12. 12. Predicate<Matcher> matches = new Predicate<Matcher>() {
 @Override
 public boolean test(Matcher matcher) {
 return matcher.find();
 }
 }; Predicate<Matcher> matches = What is a Lambda? matcher Predicate<Matcher> matches = matcher.find() matcher matcher.find()
  13. 13. Predicate<Matcher> matches = new Predicate<Matcher>() {
 @Override
 public boolean test(Matcher matcher) {
 return matcher.find();
 }
 }; Predicate<Matcher> matches = What is a Lambda? matcherPredicate<Matcher> matches = matcher.find() matcher matcher.find()
  14. 14. Predicate<Matcher> matches = new Predicate<Matcher>() {
 @Override
 public boolean test(Matcher matcher) {
 return matcher.find();
 }
 }; Predicate<Matcher> matches = What is a Lambda? matcherPredicate<Matcher> matches = matcher.find() -> matcher matcher.find()
  15. 15. Predicate<Matcher> matches = new Predicate<Matcher>() {
 @Override
 public boolean test(Matcher matcher) {
 return matcher.find();
 }
 }; Predicate<Matcher> matches = What is a Lambda? matcherPredicate<Matcher> matches = matcher.find()-> matcher matcher.find()
  16. 16. Predicate<Matcher> matches = new Predicate<Matcher>() {
 @Override
 public boolean test(Matcher matcher) {
 return matcher.find();
 }
 }; Predicate<Matcher> matches = What is a Lambda? matcherPredicate<Matcher> matches = A lambda is a function from arguments to result matcher.find()-> matcher matcher.find()
  17. 17. Example: Processing GC Logfile ⋮ 2.869: Application time: 1.0001540 seconds 5.342: Application time: 0.0801231 seconds 8.382: Application time: 1.1013574 seconds ⋮
  18. 18. Example: Processing GC Logfile ⋮ 2.869: Application time: 1.0001540 seconds 5.342: Application time: 0.0801231 seconds 8.382: Application time: 1.1013574 seconds ⋮ DoubleSummaryStatistics {count=3, sum=2.181635, min=0.080123, average=0.727212, max=1.101357}
  19. 19. Old School Code DoubleSummaryStatistic summary = new DoubleSummaryStatistic(); Pattern stoppedTimePattern = Pattern.compile("Application time: (d+.d+)"); 
 while ( ( logRecord = logFileReader.readLine()) != null) {
 Matcher matcher = stoppedTimePattern.matcher(logRecord);
 if ( matcher.find()) { double value = Double.parseDouble( matcher.group(1));
 summary.add( value);
 }
 }
  20. 20. Old School Code DoubleSummaryStatistic summary = new DoubleSummaryStatistic(); Pattern stoppedTimePattern = Pattern.compile("Application time: (d+.d+)"); 
 while ( ( logRecord = logFileReader.readLine()) != null) {
 Matcher matcher = stoppedTimePattern.matcher(logRecord);
 if ( matcher.find()) { double value = Double.parseDouble( matcher.group(1));
 summary.add( value);
 }
 } Let’s look at the features in this code
  21. 21. Data Source DoubleSummaryStatistic summary = new DoubleSummaryStatistic(); Pattern stoppedTimePattern = Pattern.compile("Application time: (d+.d+)"); 
 while ( ( logRecord = logFileReader.readLine()) != null) {
 Matcher matcher = stoppedTimePattern.matcher(logRecord);
 if ( matcher.find()) { double value = Double.parseDouble( matcher.group(1));
 summary.add( value);
 }
 }
  22. 22. Map to Matcher DoubleSummaryStatistic summary = new DoubleSummaryStatistic(); Pattern stoppedTimePattern = Pattern.compile("Application time: (d+.d+)"); 
 while ( ( logRecord = logFileReader.readLine()) != null) {
 Matcher matcher = stoppedTimePattern.matcher(logRecord);
 if ( matcher.find()) { double value = Double.parseDouble( matcher.group(1));
 summary.add( value);
 }
 }
  23. 23. Filter DoubleSummaryStatistic summary = new DoubleSummaryStatistic(); Pattern stoppedTimePattern = Pattern.compile("Application time: (d+.d+)"); 
 while ( ( logRecord = logFileReader.readLine()) != null) {
 Matcher matcher = stoppedTimePattern.matcher(logRecord);
 if ( matcher.find()) { double value = Double.parseDouble( matcher.group(1));
 summary.add( value);
 }
 }
  24. 24. Map to Double DoubleSummaryStatistic summary = new DoubleSummaryStatistic(); Pattern stoppedTimePattern = Pattern.compile("Application time: (d+.d+)"); 
 while ( ( logRecord = logFileReader.readLine()) != null) {
 Matcher matcher = stoppedTimePattern.matcher(logRecord);
 if ( matcher.find()) { double value = Double.parseDouble( matcher.group(1));
 summary.add( value);
 }
 }
  25. 25. Collect Results (Reduce) DoubleSummaryStatistic summary = new DoubleSummaryStatistic(); Pattern stoppedTimePattern = Pattern.compile("Application time: (d+.d+)"); 
 while ( ( logRecord = logFileReader.readLine()) != null) {
 Matcher matcher = stoppedTimePattern.matcher(logRecord);
 if ( matcher.find()) { double value = Double.parseDouble( matcher.group(1));
 summary.add( value);
 }
 }
  26. 26. Java 8 Streams • A sequence of values,“in motion” • source and intermediate operations set the stream up lazily • a terminal operation “pulls” values eagerly down the stream collection.stream() .intermediateOp ⋮ .intermediateOp .terminalOp
  27. 27. Stream Sources • New method Collection.stream() • Many other sources: • Arrays.stream(Object[]) • Streams.of(Object...) • Stream.iterate(Object,UnaryOperator) • Files.lines() • BufferedReader.lines() • Random.ints() • JarFile.stream() • …
  28. 28. Imperative to Stream DoubleSummaryStatistics statistics =
 Files.lines(new File(“gc.log”).toPath())
 .map(stoppedTimePattern::matcher)
 .filter(Matcher::find)
 .map(matcher -> matcher.group(1)) .mapToDouble(Double::parseDouble) .summaryStatistics();
  29. 29. Stream Source DoubleSummaryStatistics statistics =
 Files.lines(new File(“gc.log”).toPath())
 .map(stoppedTimePattern::matcher)
 .filter(Matcher::find)
 .map(matcher -> matcher.group(1)) .mapToDouble(Double::parseDouble) .summaryStatistics();
  30. 30. Intermediate Operations DoubleSummaryStatistics statistics =
 Files.lines(new File(“gc.log”).toPath())
 .map(stoppedTimePattern::matcher)
 .filter(Matcher::find)
 .map(matcher -> matcher.group(1)) .mapToDouble(Double::parseDouble) .summaryStatistics();
  31. 31. Method References DoubleSummaryStatistics statistics =
 Files.lines(new File(“gc.log”).toPath())
 .map(stoppedTimePattern::matcher)
 .filter(Matcher::find)
 .map(matcher -> matcher.group(1)) .mapToDouble(Double::parseDouble) .summaryStatistics();
  32. 32. Terminal Operation DoubleSummaryStatistics statistics =
 Files.lines(new File(“gc.log”).toPath())
 .map(stoppedTimePattern::matcher)
 .filter(Matcher::find)
 .map(matcher -> matcher.group(1)) .mapToDouble(Double::parseDouble) .summaryStatistics();
  33. 33. Visualising Sequential Streams x2x0 x1 x3x0 x1 x2 x3 Source Map Filter Reduction Intermediate Operations Terminal Operation “Values in Motion”
  34. 34. Visualising Sequential Streams x2x0 x1 x3x1 x2 x3 ✔ Source Map Filter Reduction Intermediate Operations Terminal Operation “Values in Motion”
  35. 35. Visualising Sequential Streams x2x0 x1 x3 x1x2 x3 ❌✔ Source Map Filter Reduction Intermediate Operations Terminal Operation “Values in Motion”
  36. 36. Visualising Sequential Streams x2x0 x1 x3 x1x2x3 ❌✔ Source Map Filter Reduction Intermediate Operations Terminal Operation “Values in Motion”
  37. 37. Old School: 13.3 secs Sequential: 13.8 secs - Should be the same workload - Stream code is cleaner, easier to read How Does It Perform? 24M line file, MacBook Pro, Haswell i7, 4 cores, hyperthreaded, Java 9.0
  38. 38. Can We Do Better? • We might be able to if the workload is parallelizable • split stream into many segments • process each segment • combine results • Requirements exactly match Fork/Join workflow
  39. 39. x2 Visualizing Parallel Streams x0 x1 x3 x0 x1 x2 x3
  40. 40. x2 Visualizing Parallel Streams x0 x1 x3 x0 x1 x2 x3
  41. 41. x2 Visualizing Parallel Streams x1 x3 x0 x1 x3 ✔ ❌
  42. 42. x2 Visualizing Parallel Streams x1 y3 x0 x1 x3 ✔ ❌
  43. 43. Splitting Stream Sources • Stream source is a Spliterator • can both iterate over data and – where possible – split it
  44. 44. Splitting the Data
  45. 45. Splitting the Data
  46. 46. Splitting the Data
  47. 47. Splitting the Data
  48. 48. Splitting the Data
  49. 49. Splitting the Data
  50. 50. Splitting the Data
  51. 51. Splitting the Data
  52. 52. Splitting the Data
  53. 53. Parallel Streams DoubleSummaryStatistics statistics =
 Files.lines(new File(“gc.log”).toPath()) .parallel()
 .map(stoppedTimePattern::matcher)
 .filter(Matcher::find)
 .map(matcher -> matcher.group(1)) .mapToDouble(Double::parseDouble) .summaryStatistics();
  54. 54. About Fork/Join • Introduced in Java 7 • draws from a common pool of ForkJoinWorkerThread • default pool size == HW cores – 1 • assumes workload will be CPU bound • On its own, not an easy coding idiom • parallel streams provide an abstraction layer • Spliterator defines how to split stream • framework code submits sub-tasks to the common Fork/Join pool
  55. 55. Old School: 13.3 secs Sequential: 13.8 secs Parallel: 9.5 secs - 1.45x faster - but not 8x faster (????) How Does That Perform? 24M lines, 2.8GHz 8-core i7, 16GB, OS X, Java 9.0
  56. 56. In Fact!!!! • Different benchmarks yield a mixed bag of results • some were better • some were the same • some were worse!
  57. 57. Open Questions • Under what conditions are things better • or worse • When should we parallelize • and when is serial better
  58. 58. Open Questions • Under what conditions are things better • or worse • When should we parallelize • and when is serial better Answer depends upon where the bottleneck is
  59. 59. Where is Our Bottleneck? • I/O operations • not a surprise, we’re reading from a file • Java 9 uses FileChannelLineSpliterator • 2x better than Java 8’s implementation 76.0% 0 + 5941 sun.nio.ch.FileDispatcherImpl.pread0
  60. 60. Poorly Splitting Sources • Some sources split worse than others • LinkedList vs ArrayList • Streaming I/O is problematic • more threads == more pressure on contended resource • thrashing and other ill effects • Workload size doesn’t cover the overheads
  61. 61. Streaming I/O Bottleneck x2x0 x1 x3x0 x1 x2 x3
  62. 62. Streaming I/O Bottleneck ✔ ❌ x2x1x0 x1 x3
  63. 63. 5.342: … nds LineSpliterator 2.869:Applicati … seconds n 8.382: … nds 9.337:App … ndsn n n spliterator coverage
  64. 64. 5.342: … nds LineSpliterator 2.869:Applicati … seconds n 8.382: … nds 9.337:App … ndsn n n spliterator coverage MappedByteBuffer
  65. 65. 5.342: … nds LineSpliterator 2.869:Applicati … seconds n 8.382: … nds 9.337:App … ndsn n n spliterator coverage MappedByteBuffer mid
  66. 66. 5.342: … nds LineSpliterator 2.869:Applicati … seconds n 8.382: … nds 9.337:App … ndsn n n spliterator coverage MappedByteBuffer mid
  67. 67. 5.342: … nds LineSpliterator 2.869:Applicati … seconds n 8.382: … nds 9.337:App … ndsn n n spliterator coveragenew spliterator coverage MappedByteBuffer mid
  68. 68. 5.342: … nds LineSpliterator 2.869:Applicati … seconds n 8.382: … nds 9.337:App … ndsn n n spliterator coveragenew spliterator coverage MappedByteBuffer mid Included in JDK9 as FileChannelLinesSpliterator
  69. 69. In-memory Comparison • Read GC log into an ArrayList prior to processing
  70. 70. Old School: 9.4 secs Sequential: 9.9 secs Parallel: 2.7 secs - 4.25x faster - better but still not 8x faster In-memory Comparison 24M lines, 2.8GHz 8 core i7, 16GB, OS X, JDK 9.0
  71. 71. Justifying the Overhead CPNQ performance model: C - number of submitters P - number of CPUs N - number of elements Q - cost of the operation cost of intermediate operations is N * Q overhead of setting up F/J framework is ~100µs
  72. 72. Amortizing Setup Costs • N*Q needs to be large • Q can often only be estimated • N may only be known at run time • Rule of thumb, N > 10,000 • P is the number of processors • P == number for cores for CPU bound • P < number of cores otherwise
  73. 73. Other Gotchas • Frequent hand-offs place pressure on thread schedulers • effect is magnified when a hypervisor is involved • estimated 80,000 cycles to handoff data between threads • you can do a lot of processing in 80,000 cycles • Too many threads places pressure on thread schedulers • responsible for other ill effects (TTSP) • too few threads may leave hardware under-utilized
  74. 74. Simulated Server Environment ExecutorService threadPool = Executors.newFixedThreadPool(10); threadPool.execute(() -> { try { long timer = System.currentTimeMillis(); value = Files.lines( new File(“gc.log").toPath()).parallel() .map(applicationStoppedTimePattern::matcher) .filter(Matcher::find) .map( matcher -> matcher.group(2)) .mapToDouble(Double::parseDouble) .summaryStatistics().getSum(); } catch (Exception ex) {} });
  75. 75. Work Flow and Results • First task to arrive will consume all ForkJoinWorkerThread • downstream tasks wait for a ForkJoinWorkerThread • downstream tasks start intermixing with initial task • Initial task collects dead time as it competes for threads • all other tasks collect dead time as they either • compete or wait for a ForkJoinWorkerThread
  76. 76. Work Flow and Results • First task to arrive will consume all ForkJoinWorkerThread • downstream tasks wait for a ForkJoinWorkerThread • downstream tasks start intermixing with initial task • Initial task collects dead time as it competes for threads • all other tasks collect dead time as they either • compete or wait for a ForkJoinWorkerThread System is stressed beyond capacity
  77. 77. Intermediate Operation Bottleneck 68.6% 1384 + 0 java.util.regex.Pattern$Curly.match 26.6% 521 + 15 java.util.stream.ReferencePipeline$3$1.accept
  78. 78. Intermediate Operation Bottleneck • Bottleneck is in pattern matching • but, streaming infrastructure isn’t far behind! 68.6% 1384 + 0 java.util.regex.Pattern$Curly.match 26.6% 521 + 15 java.util.stream.ReferencePipeline$3$1.accept
  79. 79. Tragedy of the Commons Garrett Hardin, ecologist (1968): Imagine the grazing of animals on a common ground. Each flock owner gains if they add to their own flock. But every animal added to the total degrades the commons a small amount.
  80. 80. Tragedy of the Commons
  81. 81. Tragedy of the Commons You have a finite amount of hardware – it might be in your best interest to grab it all – but if everyone behaves the same way…
  82. 82. Simulated Server Environment
  83. 83. Simulated Server Environment • Submit 10 tasks to Fork-Join (via Executor thread-pool) • first result comes out in 32 seconds • compared to 9.5 seconds for individually submitted task • high system time reflects task is I/O bounded
  84. 84. In-MemoryVariation
  85. 85. In-MemoryVariation • Preload log file
  86. 86. In-MemoryVariation • Preload log file • Submit 10 tasks to Fork-Join (via Executor thread-pool) • first result comes out in 23 seconds • compared to 4.5 seconds for individually submitted task • task is CPU bound
  87. 87. Conclusions Sequential stream performance comparable to imperative code Going parallel is worthwhile IF - task is suitable - expensive enough to amortize setup costs - no inter-task communication needed - data source is suitable - environment is suitable Need to monitor JDK to understanding bottlenecks - Fork/Join pool is not well instrumented
  88. 88. Resources http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html
  89. 89. Resources http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html

×