Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Performance of Java 8 Streams

413 views

Published on

Tutorial talk, given jointly with Kirk Pepperdine, at Devoxx Belgium, November 2015

Published in: Software
  • Be the first to comment

  • Be the first to like this

Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Performance of Java 8 Streams

  1. 1. Good and Wicked Fairies The Tragedy of the Commons Understanding the Performance of Java 8 Streams Kirk Pepperdine @kcpeppe MauriceNaftalin @mauricenaftalin #java8fairies
  2. 2. About Kirk • Specialises in performance tuning • speaks frequently about performance • author of performance tuning workshop • Co-founder jClarity • performance diagnositic tooling • Java Champion (since 2006)
  3. 3. About Kirk • Specialises in performance tuning • speaks frequently about performance • author of performance tuning workshop • Co-founder jClarity • performance diagnositic tooling • Java Champion (since 2006)
  4. 4. Kirk’s quiz: what is this?
  5. 5. About Maurice Developer, designer, architect, teacher, learner, writer
  6. 6. About Maurice Co-author Developer, designer, architect, teacher, learner, writer
  7. 7. About Maurice Co-author Author Developer, designer, architect, teacher, learner, writer
  8. 8. About Maurice Co-author Author Developer, designer, architect, teacher, learner, writer
  9. 9. @mauricenaftalin The Lambda FAQ www.lambdafaq.org
  10. 10. Varian 620/i 6 First Computer I Used
  11. 11. Agenda #java8fairies
  12. 12. Agenda • Define the problem #java8fairies
  13. 13. Agenda • Define the problem • Implement a solution #java8fairies
  14. 14. Agenda • Define the problem • Implement a solution • Analyse performance #java8fairies
  15. 15. Agenda • Define the problem • Implement a solution • Analyse performance – find the bottleneck #java8fairies
  16. 16. Agenda • Define the problem • Implement a solution • Analyse performance – find the bottleneck #java8fairies
  17. 17. Agenda • Define the problem • Implement a solution • Analyse performance – find the bottleneck • Fork/Join parallelism in the real world #java8fairies
  18. 18. Agenda • Define the problem • Implement a solution • Analyse performance – find the bottleneck • Fork/Join parallelism in the real world #java8fairies
  19. 19. Agenda • Define the problem • Implement a solution • Analyse performance – find the bottleneck • Fork/Join parallelism in the real world #java8fairies
  20. 20. Case Study: grep -b grep -b: “The offset in bytes of a matched pattern is displayed in front of the matched line.”
  21. 21. Case Study: grep -b The Moving Finger writes; and, having writ, Moves on: nor all thy Piety nor Wit Shall bring it back to cancel half a Line Nor all thy Tears wash out a Word of it. rubai51.txt grep -b: “The offset in bytes of a matched pattern is displayed in front of the matched line.”
  22. 22. Case Study: grep -b The Moving Finger writes; and, having writ, Moves on: nor all thy Piety nor Wit Shall bring it back to cancel half a Line Nor all thy Tears wash out a Word of it. rubai51.txt grep -b: “The offset in bytes of a matched pattern is displayed in front of the matched line.” $ grep -b 'W.*t' rubai51.txt 44:Moves on: nor all thy Piety nor Wit 122:Nor all thy Tears wash out a Word of it.
  23. 23. Case Study: grep -b
  24. 24. Obvious iterative implementation - process file line by line - maintain byte displacement in an accumulator variable Case Study: grep -b
  25. 25. Obvious iterative implementation - process file line by line - maintain byte displacement in an accumulator variable Objective here is to implement it in stream code - no accumulators allowed! Case Study: grep -b
  26. 26. Streams – Why? • Intention: replace loops for aggregate operations 10
  27. 27. Streams – Why? • Intention: replace loops for aggregate operations List<Person> people = … Set<City> shortCities = new HashSet<>();
 for (Person p : people) { City c = p.getCity(); if (c.getName().length() < 4 ) { shortCities.add(c); } } instead of writing this: 10
  28. 28. Streams – Why? • Intention: replace loops for aggregate operations • more concise, more readable, composable operations, parallelizable Set<City> shortCities = new HashSet<>();
 for (Person p : people) { City c = p.getCity(); if (c.getName().length() < 4 ) { shortCities.add(c); } } instead of writing this: List<Person> people = … Set<City> shortCities = people.stream() .map(Person::getCity)
 .filter(c -> c.getName().length() < 4) .collect(toSet()); 11 we’re going to write this:
  29. 29. Streams – Why? • Intention: replace loops for aggregate operations • more concise, more readable, composable operations, parallelizable Set<City> shortCities = new HashSet<>();
 for (Person p : people) { City c = p.getCity(); if (c.getName().length() < 4 ) { shortCities.add(c); } } instead of writing this: List<Person> people = … Set<City> shortCities = people.stream() .map(Person::getCity)
 .filter(c -> c.getName().length() < 4) .collect(toSet()); 11 we’re going to write this:
  30. 30. Streams – Why? • Intention: replace loops for aggregate operations • more concise, more readable, composable operations, parallelizable Set<City> shortCities = new HashSet<>();
 for (Person p : people) { City c = p.getCity(); if (c.getName().length() < 4 ) { shortCities.add(c); } } instead of writing this: List<Person> people = … Set<City> shortCities = people.parallelStream() .map(Person::getCity)
 .filter(c -> c.getName().length() < 4) .collect(toSet()); 12 we’re going to write this:
  31. 31. Visualising Sequential Streams x2x0 x1 x3x0 x1 x2 x3 Source Map Filter Reduction Intermediate Operations Terminal Operation “Values in Motion”
  32. 32. Visualising Sequential Streams x2x0 x1 x3x1 x2 x3 ✔ Source Map Filter Reduction Intermediate Operations Terminal Operation “Values in Motion”
  33. 33. Visualising Sequential Streams x2x0 x1 x3 x1x2 x3 ❌✔ Source Map Filter Reduction Intermediate Operations Terminal Operation “Values in Motion”
  34. 34. Visualising Sequential Streams x2x0 x1 x3 x1x2x3 ❌✔ Source Map Filter Reduction Intermediate Operations Terminal Operation “Values in Motion”
  35. 35. Journey’s End, JavaOne, October 2015 Simple Collector – toSet() Collectors.toSet() people.stream().collect(Collectors.toSet())
  36. 36. Journey’s End, JavaOne, October 2015 Simple Collector – toSet() Stream<Person> Collectors.toSet() Set<Person> people.stream().collect(Collectors.toSet()) Collector<Person,?,Set<Person>>
  37. 37. Journey’s End, JavaOne, October 2015 Simple Collector – toSet() Stream<Person> Collectors.toSet() Set<Person> people.stream().collect(Collectors.toSet()) bill Collector<Person,?,Set<Person>>
  38. 38. Journey’s End, JavaOne, October 2015 Simple Collector – toSet() Stream<Person> Collectors.toSet() Set<Person> people.stream().collect(Collectors.toSet()) bill Collector<Person,?,Set<Person>>
  39. 39. Journey’s End, JavaOne, October 2015 Simple Collector – toSet() Stream<Person> Collectors.toSet() Set<Person> people.stream().collect(Collectors.toSet()) bill Collector<Person,?,Set<Person>>
  40. 40. Journey’s End, JavaOne, October 2015 Simple Collector – toSet() Stream<Person> Collectors.toSet() Set<Person> people.stream().collect(Collectors.toSet()) bill jon Collector<Person,?,Set<Person>>
  41. 41. Journey’s End, JavaOne, October 2015 Simple Collector – toSet() Stream<Person> Collectors.toSet() Set<Person> people.stream().collect(Collectors.toSet()) bill jon Collector<Person,?,Set<Person>>
  42. 42. Journey’s End, JavaOne, October 2015 Simple Collector – toSet() Stream<Person> Collectors.toSet() Set<Person> people.stream().collect(Collectors.toSet()) amy bill jon Collector<Person,?,Set<Person>>
  43. 43. Journey’s End, JavaOne, October 2015 Simple Collector – toSet() Stream<Person> Collectors.toSet() Set<Person> people.stream().collect(Collectors.toSet()) amy bill jon Collector<Person,?,Set<Person>>
  44. 44. Journey’s End, JavaOne, October 2015 Simple Collector – toSet() Collectors.toSet() Set<Person> people.stream().collect(Collectors.toSet()) amy bill jon Collector<Person,?,Set<Person>>
  45. 45. Journey’s End, JavaOne, October 2015 Simple Collector – toSet() Collectors.toSet() Set<Person> { , , } people.stream().collect(Collectors.toSet()) amybilljon Collector<Person,?,Set<Person>>
  46. 46. Classical Reduction + + + 0 1 2 3 + + + 0 4 5 6 7 0 ++ +
  47. 47. Classical Reduction + + + 0 1 2 3 + + + 0 4 5 6 7 0 ++ + intStream.reduce(0,(a,b) -> a+b)
  48. 48. Mutable Reduction a a a e0 e1 e2 e3 a a a e4 e5 e6 e7 aa c a: accumulator c: combiner Supplier ()->[] ()->[]
  49. 49. Mutable Reduction a a a e0 e1 e2 e3 a a a e4 e5 e6 e7 aa c a: accumulator c: combiner Supplier ()->[] ()->[]
  50. 50. Agenda • Define the problem • Implement a solution • Analyse performance – find the bottleneck • Fork/Join parallelism in the real world #java8fairies
  51. 51. 122Nor …Moves … 44 grep -b: Collector combiner 0The …[ , 80Shall … ], ,
  52. 52. 41 122Nor …Moves … 36 44 grep -b: Collector combiner 0The … 44[ , 42 80Shall … ], ,
  53. 53. 41 122Nor …Moves … 36 44 grep -b: Collector combiner 0The … 44[ , 42 80Shall … ] 41 42Nor …Moves … 36 440The … 44[ , 42 0Shall … ] , , ,] [
  54. 54. 41 122Nor …Moves … 36 44 grep -b: Collector solution 0The … 44[ , 42 80Shall … ] 41 42Nor …Moves … 36 440The … 44[ , 42 0Shall … ] , , ,] [
  55. 55. 41 122Nor …Moves … 36 44 grep -b: Collector solution 0The … 44[ , 42 80Shall … ] 41 42Nor …Moves … 36 440The … 44[ , 42 0Shall … ] , , ,] [ 80
  56. 56. 41 122Nor …Moves … 36 44 grep -b: Collector solution 0The … 44[ , 42 80Shall … ] 41 42Nor …Moves … 36 440The … 44[ , 42 0Shall … ] , , ,] [ 80
  57. 57. 41 122Nor …Moves … 36 44 grep -b: Collector solution 0The … 44[ , 42 80Shall … ] 41 42Nor …Moves … 36 440The … 44[ , 42 0Shall … ] , , ,] [ 80
  58. 58. 41 122Nor …Moves … 36 44 grep -b: Collector combiner 0The … 44[ , 42 80Shall … ] 41 42Nor …Moves … 36 440The … 44[ , 42 0Shall … ] , , ,] [ Moves … 36 0 41 0Nor …42 0Shall …0The … 44[ ]] [[ [] ]
  59. 59. grep -b: Collector accumulator 44 0The moving … writ, “Moves on: … Wit” 44 0The moving … writ, 36 44Moves on: … Wit ][ ][ , [ ] Supplier “The moving … writ,” accumulator accumulator
  60. 60. Agenda • Define the problem • Implement a solution • Analyse performance – find the bottleneck • Fork/Join parallelism in the real world #java8fairies
  61. 61. Because we don’t have a problem? Why Shouldn’t We Optimize Code?
  62. 62. Why Shouldn’t We Optimize Code? Because we don’t have a problem - No performance target!
  63. 63. Why Shouldn’t We Optimize Code?
  64. 64. Because we don’t have a problem - No performance target! Else there is a problem, but not in our process Why Shouldn’t We Optimize Code?
  65. 65. Because we don’t have a problem - No performance target! Else there is a problem, but not in our process Why Shouldn’t We Optimize Code? Demo…
  66. 66. Because we don’t have a problem - No performance target! Else there is a problem, but not in our process - The OS is struggling! Why Shouldn’t We Optimize Code?
  67. 67. Because we don’t have a problem - No performance target! Else there is a problem, but not in our process - The OS is struggling! Else there’s a problem in our process, but not in the code Why Shouldn’t We Optimize Code?
  68. 68. Because we don’t have a problem - No performance target! Else there is a problem, but not in our process - The OS is struggling! Else there’s a problem in our process, but not in the code Why Shouldn’t We Optimize Code? Demo…
  69. 69. Because we don’t have a problem - No performance target! Else there is a problem, but not in our process - The OS is struggling! Else there’s a problem in our process, but not in the code - GC is using all the cycles! Why Shouldn’t We Optimize Code?
  70. 70. Because we don’t have a problem - No performance target! Else there is a problem, but not in our process - The OS is struggling! Else there’s a problem in our process, but not in the code - GC is using all the cycles! Now we can consider the code Else there’s a problem in the code… somewhere - now we can go and profile it
  71. 71. Because we don’t have a problem - No performance target! Else there is a problem, but not in our process - The OS is struggling! Else there’s a problem in our process, but not in the code - GC is using all the cycles! Now we can consider the code Else there’s a problem in the code… somewhere - now we can go and profile it Demo…
  72. 72. So, streaming IO is slow
  73. 73. So, streaming IO is slow If only there was a way of getting all that data into memory…
  74. 74. So, streaming IO is slow If only there was a way of getting all that data into memory… Demo…
  75. 75. Agenda • Define the problem • Implement a solution • Analyse performance – find the bottleneck • Fork/Join parallelism in the real world #java8fairies
  76. 76. Agenda • Define the problem • Implement a solution • Analyse performance – find the bottleneck OR – have a bright idea! • Fork/Join parallelism in the real world #java8fairies
  77. 77. Intel Xeon E5 2600 10-core
  78. 78. Parallelism – Why? The Free Lunch Is Over http://www.gotw.ca/publications/concurrency-ddj.htm
  79. 79. The Transistor, Blessed at Birth
  80. 80. What’s Happened?
  81. 81. What’s Happened? Physical limitations of the technology:
  82. 82. What’s Happened? Physical limitations of the technology: • signal leakage
  83. 83. What’s Happened? Physical limitations of the technology: • signal leakage • heat dissipation
  84. 84. What’s Happened? Physical limitations of the technology: • signal leakage • heat dissipation • speed of light!
  85. 85. What’s Happened? Physical limitations of the technology: • signal leakage • heat dissipation • speed of light! – 30cm = 1 light-nanosecond
  86. 86. What’s Happened? Physical limitations of the technology: • signal leakage • heat dissipation • speed of light! – 30cm = 1 light-nanosecond We’re not going to get faster cores,
  87. 87. What’s Happened? Physical limitations of the technology: • signal leakage • heat dissipation • speed of light! – 30cm = 1 light-nanosecond We’re not going to get faster cores, we’re going to get more cores!
  88. 88. x2 Visualizing Parallel Streams x0 x1 x3 x0 x1 x2 x3
  89. 89. x2 Visualizing Parallel Streams x0 x1 x3 x0 x1 x2 x3
  90. 90. x2 Visualizing Parallel Streams x1 x3 x0 x1 x3 ✔ ❌
  91. 91. x2 Visualizing Parallel Streams x1 y3 x0 x1 x3 ✔ ❌
  92. 92. A Parallel Solution for grep -b • Parallel streams need splittable sources • Streaming I/O makes you subject to Amdahl’s Law:
  93. 93. A Parallel Solution for grep -b • Parallel streams need splittable sources • Streaming I/O makes you subject to Amdahl’s Law:
  94. 94. Blessing – and Curse – on the Transistor
  95. 95. Stream Sources for Parallel Processing Implemented by a Spliterator
  96. 96. Stream Sources for Parallel Processing Implemented by a Spliterator
  97. 97. Stream Sources for Parallel Processing Implemented by a Spliterator
  98. 98. Stream Sources for Parallel Processing Implemented by a Spliterator
  99. 99. Stream Sources for Parallel Processing Implemented by a Spliterator
  100. 100. Stream Sources for Parallel Processing Implemented by a Spliterator
  101. 101. Stream Sources for Parallel Processing Implemented by a Spliterator
  102. 102. Stream Sources for Parallel Processing Implemented by a Spliterator
  103. 103. Stream Sources for Parallel Processing Implemented by a Spliterator
  104. 104. Moves …Wit LineSpliterator The moving Finger … writ n Shall … Line Nor all thy … itn n n spliterator coverage
  105. 105. Moves …Wit LineSpliterator The moving Finger … writ n Shall … Line Nor all thy … itn n n spliterator coverage MappedByteBuffer
  106. 106. Moves …Wit LineSpliterator The moving Finger … writ n Shall … Line Nor all thy … itn n n spliterator coverage MappedByteBuffer mid
  107. 107. Moves …Wit LineSpliterator The moving Finger … writ n Shall … Line Nor all thy … itn n n spliterator coverage MappedByteBuffer mid
  108. 108. Moves …Wit LineSpliterator The moving Finger … writ n Shall … Line Nor all thy … itn n n spliterator coveragenew spliterator coverage MappedByteBuffer mid
  109. 109. Moves …Wit LineSpliterator The moving Finger … writ n Shall … Line Nor all thy … itn n n spliterator coveragenew spliterator coverage MappedByteBuffer mid Demo…
  110. 110. Parallelizing grep -b • Splitting action of LineSpliterator is O(log n) • Collector no longer needs to compute index • Result (relatively independent of data size): - sequential stream ~2x as fast as iterative solution - parallel stream >2.5x as fast as sequential stream - on 4 hardware threads
  111. 111. Parallelizing Streams Parallel-unfriendly intermediate operations: stateful ones – need to store some or all of the stream data in memory – sorted() those requiring ordering – limit()
  112. 112. Collectors Cost Extra! Depends on the performance of accumulator and combiner functions • toList(), toSet(), toCollection() – performance normally dominated by accumulator • but allow for the overhead of managing multithread access to non- threadsafe containers for the combine operation • toMap(), toConcurrentMap() – map merging is slow. Resizing maps, especially concurrent maps, is very expensive. Whenever possible, presize all data structures, maps in particular.
  113. 113. Agenda • Define the problem • Implement a solution • Analyse performance – find the bottleneck • Fork/Join parallelism in the real world #java8fairies
  114. 114. Simulated Server Environment threadPool.execute(() -> { try { double value = logEntries.parallelStream() .map(applicationStoppedTimePattern::matcher) .filter(Matcher::find) .map( matcher -> matcher.group(2)) .mapToDouble(Double::parseDouble) .summaryStatistics().getSum(); } catch (Exception ex) {} });
  115. 115. How Does It Perform? • Total run time: 261.7 seconds • Max: 39.2 secs, Min: 9.2 secs, Median: 22.0 secs
  116. 116. Tragedy of the Commons Garrett Hardin, ecologist (1968): Imagine the grazing of animals on a common ground. Each flock owner gains if they add to their own flock. But every animal added to the total degrades the commons a small amount.
  117. 117. Tragedy of the Commons
  118. 118. Tragedy of the Commons You have a finite amount of hardware – it might be in your best interest to grab it all – but if everyone behaves the same way…
  119. 119. Tragedy of the Commons You have a finite amount of hardware – it might be in your best interest to grab it all – but if everyone behaves the same way… Be a good neighbor
  120. 120. Tragedy of the Commons You have a finite amount of hardware – it might be in your best interest to grab it all – but if everyone behaves the same way… With many parallelStream() operations running concurrently, performance is limited by the size of the common thread pool and the number of cores you have Be a good neighbor
  121. 121. Configuring Common Pool Size of common ForkJoinPool is • Runtime.getRuntime().availableProcessors() - 1 -Djava.util.concurrent.ForkJoinPool.common.parallelism=N -Djava.util.concurrent.ForkJoinPool.common.threadFactory -Djava.util.concurrent.ForkJoinPool.common.exceptionHandler
  122. 122. Fork-Join Support for Fork-Join added in Java 7 • difficult coding idiom to master Used internally by parallel streams • uses a spliterator to segment the stream • each stream is processed by a ForkJoinWorkerThread How fork-join works and performs is important to latency
  123. 123. ForkJoinPool invoke ForkJoinPool.invoke(ForkJoinTask) uses the submitting thread as a worker • If 100 threads all call invoke(), we would have 100+ForkJoinThreads exhausting the limiting resource, e.g. CPUs, IO, etc.
  124. 124. ForkJoinPool submit/get ForkJoinPool.submit(Callable).get() suspends the submitting thread • If 100 threads all call submit(), the work queue can become very long, thus adding latency
  125. 125. Fork-Join Performance Fork Join comes with significant overhead • each chunk of work must be large enough to amortize the overhead
  126. 126. C/P/N/Q Performance Model C - number of submitters P - number of CPUs N - number of elements Q - cost of the operation
  127. 127. When to go Parallel The workload of the intermediate operations must be great enough to outweigh the overheads (~100µs): – initializing the fork/join framework – splitting – concurrent collection Often quoted as N x Q size of data set (typically > 10,000) processing cost per element
  128. 128. Kernel Times CPU will not be the limiting factor when • CPU is not saturated • kernel times exceed 10% of user time More threads will decrease performance • predicted by Little’s Law
  129. 129. Common Thread Pool Fork-Join by default uses a common thread pool • default number of worker threads == number of logical cores - 1 • Always contains at least one thread Performance is tied to whichever you run out of first • availability of the constraining resource • number of ForkJoinWorkerThreads/hardware threads
  130. 130. Everyone is working together to get it done! All Hands on Deck!!!
  131. 131. Little’s Law Fork-Join is a work queue • work queue behavior is typically modeled using Little’s Law Number of tasks in a system equals the arrival rate times the amount of time it takes to clear an item Task is submitted every 500ms, or 2 per second Number of tasks = 2/sec * 2.8 seconds = 5.6 tasks
  132. 132. Components of Latency Latency is time from stimulus to result • internally, latency consists of active and dead time Reducing dead time assumes you • can find it and are able to fill in with useful work
  133. 133. From Previous Example if there is available hardware capacity then make the pool bigger else add capacity or tune to reduce strength of the dependency
  134. 134. ForkJoinPool Observability • In an application where you have many parallel stream operations all running concurrently performance will be affected by the size of the common thread pool • too small can starve threads from needed resources • too big can cause threads to thrash on contended resources ForkJoinPool comes with no visibility • no metrics to help us tune • instrument ForkJoinTask.invoke() • gather measures that can be feed into Little’s Law
  135. 135. • Collect • service times • time submitted to time returned • inter-arrival times Instrumenting ForkJoinPool public final V invoke() { ForkJoinPool.common.getMonitor().submitTask(this); int s; if ((s = doInvoke() & DONE_MASK) != NORMAL) reportException(s); ForkJoinPool.common.getMonitor().retireTask(this); return getRawResult(); }
  136. 136. Performance Submit log parsing to our own ForkJoinPool new ForkJoinPool(16).submit(() -> ……… ).get() new ForkJoinPool(8).submit(() -> ……… ).get() new ForkJoinPool(4).submit(() -> ……… ).get()
  137. 137. 16 worker threads 8 worker threads 4 worker threads
  138. 138. 0 75000 150000 225000 300000 Stream Parallel Flood Stream Flood Parallel
  139. 139. Performance mostly doesn’t matter But if you must… • sequential streams normally beat iterative solutions • parallel streams can utilize all cores, providing - the data is efficiently splittable - the intermediate operations are sufficiently expensive and are CPU-bound - there isn’t contention for the processors Conclusions #java8fairies
  140. 140. Resources #java8fairies
  141. 141. Resources http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html #java8fairies
  142. 142. Resources http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html http://shipilev.net/talks/devoxx-Nov2013-benchmarking.pdf #java8fairies
  143. 143. Resources http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html http://shipilev.net/talks/devoxx-Nov2013-benchmarking.pdf http://openjdk.java.net/projects/code-tools/jmh/ #java8fairies
  144. 144. Resources http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html http://shipilev.net/talks/devoxx-Nov2013-benchmarking.pdf http://openjdk.java.net/projects/code-tools/jmh/ #java8fairies

×