Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Project Tungsten Phase II
Joining a Billion Rows per Second on a Laptop
Sameer Agarwal
Spark Meetup | SAP | June 30th 2016
About Me
• Software Engineerat Databricks (Spark Core/SQL)
• PhD in Databases (UC Berkeley)
• Researchon BlinkDB (Approxim...
Hardware Trends
Storage
Network
CPU
Hardware Trends
2010
Storage
50+MB/s
(HDD)
Network 1Gbps
CPU ~3GHz
Hardware Trends
2010 2016
Storage
50+MB/s
(HDD)
500+MB/s
(SSD)
Network 1Gbps 10Gbps
CPU ~3GHz ~3GHz
Hardware Trends
2010 2016
Storage
50+MB/s
(HDD)
500+MB/s
(SSD)
10X
Network 1Gbps 10Gbps 10X
CPU ~3GHz ~3GHz ☹
On the flip side
Spark IO has been optimized
• Reduce IO by pruning inputdata that is notneeded
• New shuffle and networki...
Goals of Project Tungsten
Substantially improve the memory and CPU efficiency ofSpark backend
execution and push performan...
Phase 1
Foundation
Memory Management
Code Generation
Cache-aware Algorithms
Phase 2
Order-of-magnitude Faster
Whole-stage ...
Phase 1
Laying The Foundation
Summary
Perform manual memory management instead of relying on Java objects
• Reduce memory footprint
• Eliminate garbage ...
Phase 2
Order-of-magnitude Faster
Going back to the fundamentals
Difficult to getorder of magnitude performancespeed ups with
profiling techniques
• For 10x...
Scan
Filter
Project
Aggregate
select count(*) from store_sales
where ss_item_sk = 1000
Volcano Iterator Model
Standard for 30 years: almost all
databases do it
Each operatoris an “iterator”
that consumes recor...
Downside of the Volcano Model
1. Too many virtual function calls
o at least 3 callsfor each row in Aggregate
2. Extensive ...
What if we hire a collegefreshmanto implement this queryin
Java in 10 mins?
select count(*) from store_sales
where ss_item...
Volcano model
30+ years of database research
college freshman
hand-written code in 10 mins
vs
Volcano 13.95 million
rows/sec
college
freshman
125 million
rows/sec
Note: End-to-end, single thread, single column, and d...
How does a student beat 30 years of research?
Volcano
1. Many virtual function calls
2. Data in memory (orcache)
3. No loo...
Whole-stage Codegen
Fusing operatorstogether so the generatedcode looks like hand
optimized code:
- Identify chains of ope...
Whole-stage Codegen: Planner
Scan
Filter
Project
Aggregate
long count = 0;
for (ss_item_sk in store_sales) {
if (ss_item_sk == 1000) {
count += 1;
}
}
...
But there are things we can’t fuse
Complicated I/O
• CSV, Parquet, ORC, …
• Sending across the network
External integratio...
Columnar in memory format
mike
In-memory
Row Format
1 john 4.1
2 3.5
3 sally 6.4
1 2 3
john mike sally
4.1 3.5 6.4
In-memo...
Why columnar?
1. More efficient: denserstorage, regulardata access, easier to index
into. Enables vectorized processing.
2...
Parquet 11 million
rows/sec
Parquet
vectorized
90 million
rows/sec
Note: End-to-end, single thread, single column, and dat...
Putting it All Together
Phase 1
Spark 1.4 - 1.6
Memory Management
Code Generation
Cache-aware Algorithms
Phase 2
Spark 2.0+
Whole-stage Code Gener...
Operator Benchmarks: Cost/Row (ns)
TPC-DS (Scale Factor 1500, 100 cores)
0
200
400
600
800
1000
1200
1400
1600
1800
1 3 5 7 9 11 13 14b 16 18 20 22 23b 24b 2...
Status
• Being releasedas partof Spark 2.0
• Both Whole stage codegenand vectorized Parquet readeris on by
default
• Back ...
Further Reading
http://tinyurl.com/project-tungsten
2016 Apache Spark Survey
34
Spark Summit EU
Brussels
October 25-27
The CFP closes at 11:59pm on July 1st
For more information and to submit:
https://s...
Questions?
Upcoming SlideShare
Loading in …5
×

Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop

1,599 views

Published on

Tech-talk at Bay Area Apache Spark Meetup.
Apache Spark 2.0 will ship with the second generation Tungsten engine. Building upon ideas from modern compilers and MPP databases, and applying them to data processing queries, we have started an ongoing effort to dramatically improve Spark’s performance and bringing execution closer to bare metal. In this talk, we’ll take a deep dive into Apache Spark 2.0’s execution engine and discuss a number of architectural changes around whole-stage code generation/vectorization that have been instrumental in improving CPU efficiency and gaining performance.

Published in: Technology

Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop

  1. 1. Project Tungsten Phase II Joining a Billion Rows per Second on a Laptop Sameer Agarwal Spark Meetup | SAP | June 30th 2016
  2. 2. About Me • Software Engineerat Databricks (Spark Core/SQL) • PhD in Databases (UC Berkeley) • Researchon BlinkDB (Approximate Queries in Spark)
  3. 3. Hardware Trends Storage Network CPU
  4. 4. Hardware Trends 2010 Storage 50+MB/s (HDD) Network 1Gbps CPU ~3GHz
  5. 5. Hardware Trends 2010 2016 Storage 50+MB/s (HDD) 500+MB/s (SSD) Network 1Gbps 10Gbps CPU ~3GHz ~3GHz
  6. 6. Hardware Trends 2010 2016 Storage 50+MB/s (HDD) 500+MB/s (SSD) 10X Network 1Gbps 10Gbps 10X CPU ~3GHz ~3GHz ☹
  7. 7. On the flip side Spark IO has been optimized • Reduce IO by pruning inputdata that is notneeded • New shuffle and networkimplementations(2014 sortrecord) Data formats have improved • E.g. Parquet isa “dense” columnarformat CPU increasingly the bottleneck;trend expected to continue
  8. 8. Goals of Project Tungsten Substantially improve the memory and CPU efficiency ofSpark backend execution and push performance closer to the limits of modern hardware. Notethe focus on “execution” not “optimizer”: very easy topick broadcast join that is 1000X faster than Cartesian join, but hard to optimize broadcast join tobe an order of magnitude faster. Tungsten Execution PythonSQL R Streaming DataFrame Advanced Analytics
  9. 9. Phase 1 Foundation Memory Management Code Generation Cache-aware Algorithms Phase 2 Order-of-magnitude Faster Whole-stage Codegen Vectorization
  10. 10. Phase 1 Laying The Foundation
  11. 11. Summary Perform manual memory management instead of relying on Java objects • Reduce memory footprint • Eliminate garbage collection overheads • Use java.unsafe and off heap memory Code generation for expression evaluation • Reduce virtual function calls and interpretation overhead Cache conscioussorting • Reduce bad memory accesspatterns
  12. 12. Phase 2 Order-of-magnitude Faster
  13. 13. Going back to the fundamentals Difficult to getorder of magnitude performancespeed ups with profiling techniques • For 10ximprovement,would need of find top hotspots that add up to 90%and make them instantaneous • For 100x,99% Instead, lookbottom up, how fast should it run?
  14. 14. Scan Filter Project Aggregate select count(*) from store_sales where ss_item_sk = 1000
  15. 15. Volcano Iterator Model Standard for 30 years: almost all databases do it Each operatoris an “iterator” that consumes recordsfrom its input operator class Filter( child: Operator, predicate: (Row => Boolean)) extends Operator { def next(): Row = { var current = child.next() while (current == null ||predicate(current)) { current = child.next() } return current } }
  16. 16. Downside of the Volcano Model 1. Too many virtual function calls o at least 3 callsfor each row in Aggregate 2. Extensive memory access o “row” is a small segmentin memory (orin L1/L2/L3 cache) 3. Can’t take advantage of modern CPU features o SIMD, pipelining,prefetching,branch prediction,ILP,instruction cache,…
  17. 17. What if we hire a collegefreshmanto implement this queryin Java in 10 mins? select count(*) from store_sales where ss_item_sk = 1000 long count = 0; for (ss_item_sk in store_sales) { if (ss_item_sk == 1000) { count += 1; } }
  18. 18. Volcano model 30+ years of database research college freshman hand-written code in 10 mins vs
  19. 19. Volcano 13.95 million rows/sec college freshman 125 million rows/sec Note: End-to-end, single thread, single column, and data originated in Parquet on disk High throughput
  20. 20. How does a student beat 30 years of research? Volcano 1. Many virtual function calls 2. Data in memory (orcache) 3. No loop unrolling,SIMD, pipelining hand-written code 1. No virtual function calls 2. Data in CPU registers 3. Compilerloop unrolling,SIMD, pipelining Take advantage of all the information that is known after query compilation
  21. 21. Whole-stage Codegen Fusing operatorstogether so the generatedcode looks like hand optimized code: - Identify chains of operators(“stages”) - Compile eachstage into a single function - Functionality of a generalpurposeexecution engine; performance as if hand built system just to run your query
  22. 22. Whole-stage Codegen: Planner
  23. 23. Scan Filter Project Aggregate long count = 0; for (ss_item_sk in store_sales) { if (ss_item_sk == 1000) { count += 1; } } Whole-stage Codegen: Spark as a “Compiler”
  24. 24. But there are things we can’t fuse Complicated I/O • CSV, Parquet, ORC, … • Sending across the network External integrations • Python,R, scikit-learn, TensorFlow,etc • Reading cached data
  25. 25. Columnar in memory format mike In-memory Row Format 1 john 4.1 2 3.5 3 sally 6.4 1 2 3 john mike sally 4.1 3.5 6.4 In-memory Column Format
  26. 26. Why columnar? 1. More efficient: denserstorage, regulardata access, easier to index into. Enables vectorized processing. 2. More compatible: Most high-performanceexternal systems are alreadycolumnar (numpy, TensorFlow, Parquet); zero serialization/copy to work with them 3. Easier to extend: processencoded data, integrate with columnar cache etc.
  27. 27. Parquet 11 million rows/sec Parquet vectorized 90 million rows/sec Note: End-to-end, single thread, single column, and data originated in Parquet on disk High throughput
  28. 28. Putting it All Together
  29. 29. Phase 1 Spark 1.4 - 1.6 Memory Management Code Generation Cache-aware Algorithms Phase 2 Spark 2.0+ Whole-stage Code Generation Columnar in Memory Support
  30. 30. Operator Benchmarks: Cost/Row (ns)
  31. 31. TPC-DS (Scale Factor 1500, 100 cores) 0 200 400 600 800 1000 1200 1400 1600 1800 1 3 5 7 9 11 13 14b 16 18 20 22 23b 24b 26 28 30 32 34 36 38 39b 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99 QueryTime(seconds) Spark 2.0 Spark 1.6
  32. 32. Status • Being releasedas partof Spark 2.0 • Both Whole stage codegenand vectorized Parquet readeris on by default • Back to profiling techniques • Improve quality of generatedcode, optimize Parquetreader further • Try it out and letus know!
  33. 33. Further Reading http://tinyurl.com/project-tungsten
  34. 34. 2016 Apache Spark Survey 34
  35. 35. Spark Summit EU Brussels October 25-27 The CFP closes at 11:59pm on July 1st For more information and to submit: https://spark-summit.org/eu-2016/ 35
  36. 36. Questions?

×