Accelerating Foreign-Key Joins using  Asymmetric Memory Channels  Holger Pirk     Stefan Manegold     Martin Kersten  holg...
Why?• Trivia: Joins   are important• But: Many    Joins are (Indexed) Foreign Key Joins• Important       example: Data War...
A History of In Memory Join Processing       (and incidentally of this talk)• Foreign-Key   Joins on CPUs • Naïve   (Index...
Premises for this talk• Focus   is on Indexed Foreign Key Joins• We   assume an unclustered Foreign Key Index Column• i.e....
Foreign-Key Joins on CPUs
The Problem: Bandwidth Bound CPUsCore 1      Core 2    ...    Core 4         Memory Controller            • Bottleneck for...
Naïve Foreign-Key Joins on CPUs
Naïve Foreign-Key Joins on CPUs•Foreign-Key Index is scanned sequentially•Target Table is randomly accessed
Naïve Foreign-Key Joins on CPUs• Optimization Targets:  • Foreign-Key   Index is scanned sequentially  • Index   Cache lin...
Naïve Foreign-Key Joins on CPUs• Large Target Tables   cause heavy Cache Thrashing• Assume    a (theoretical) 1-Slot Cache...
Clustered Foreign-Key Joins on CPUs
Clustered Foreign-Key Joins on CPUs• Require   a Value-partitioned Foreign-Key Index• Foreign-Key    Index Partitions are ...
Clustered Foreign-Key Joins on CPUs    • Require   a Value-partitioned Foreign-Key Index    • Foreign-Key    Index Partiti...
Clustered Foreign-Key Joins on CPUs• What    is the number of Target Table Cache Misses?• So   Cache behavior is greatly i...
Clustered Foreign-Key Joins on CPUs                                 6.0                                                   ...
Clustered Foreign-Key Joins on CPUs                                25.00                                          CPU     ...
Foreign-Key Joins on GPUs
GPU Architecture Overview    Processing Unit 1                                                  Processing Unit 16        ...
The Problem: Bandwidth Bound GPUs  Core 1      Core 2    ...    Core 4                     GPU        GPU                 ...
Clustered Foreign-Key Joins on GPUs
Clustered Foreign-Key Joins on GPUs• Port   of the Radix-Join from CPU to GPU [He08]• Improved    Cache Performance is Imp...
Clustered Foreign-Key Joins on GPUs                               6.0                                         Join (GPU)  ...
Naïve Foreign-Key Joins on GPUs
Naïve Foreign-Key Joins on GPUs• Primary   Bottleneck is PCI-E, not GPU Memory• Idea: Accelerate   FK Joins by exploiting ...
Asymmetric Foreign-Key Joins on GPUs• Primary   Bottleneck is PCI-E, not GPU Memory• Idea: Accelerate   FK Joins by exploi...
Asymmetric Foreign-Key Joins on GPUs                                6.0                                          Join (GPU...
Asymmetric Foreign-Key Joins on GPUs                                6.0                                          Unpartiti...
Asymmetric Foreign-Key Joins on GPUs• Message: Clustering   improves performance on GPUs• But: Cluster   Costs don’t pay o...
But what about Parallelization                             1.00                                                           ...
Foreign-Key Joins on GPUs vs CPUs                               1.100                                            Intel i7 ...
Foreign-Key Joins on GPUs vs CPUs                               1.100                                            Intel i7 ...
Conclusion• Unclustered   FK-Joins Cause Heavy Cache Thrashing• Thrashing   can be offloaded to fast GPU memory• PCI-E   Ba...
Thank you for Listening
Questions?   Feedback?
Upcoming SlideShare
Loading in...5
×

Accelerating Foreign-Key Joins using Asymmetric Memory Channels

445

Published on

The presentation was presented by Holger Pirk at the Second International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures (ADMS 2011) In conjunction with VLDB 2011, Seattle, WA on September 2, 2011.

Publication: http://bit.ly/za1DDF

Abstract:
Indexed Foreign-Key Joins expose a very asymmetric ac- cess pattern: the Foreign-Key Index is sequentially scanned whilst the Primary-Key table is target of many quasi-random lookups which is the dominant cost factor. To reduce the costs of the random lookups the fact-table can be (re-) partitioned at runtime to increase access locality on the dimen- sion table, and thus limit the random memory access to inside the CPU's cache. However, this is very hard to opti- mize and the performance impact on recent architectures is limited because the partitioning costs consume most of the achievable join improvement.

GPGPUs on the other hand have an architecture that is well suited for this operation: a relatively slow connection to the large system memory and a very fast connection to the smaller internal device memory. We show how to accelerate Foreign-Key Joins by executing the random table lookups on the GPU's VRAM while sequentially streaming the Foreign- Key-Index through the PCI-E Bus. We also experimentally study the memory access costs on GPU and CPU to provide estimations of the benefit of this technique.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
445
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Accelerating Foreign-Key Joins using Asymmetric Memory Channels"

  1. 1. Accelerating Foreign-Key Joins using Asymmetric Memory Channels Holger Pirk Stefan Manegold Martin Kersten holger@cwi.nl manegold@cwi.nl mk@cwi.nl
  2. 2. Why?• Trivia: Joins are important• But: Many Joins are (Indexed) Foreign Key Joins• Important example: Data Warehouse Star Joins • Large Fact Table (many Gigabytes) with Foreign Keys on • Small Dimension Tables (hundreds of Megabytes) • Joins are Performed for Projection or Condition Evaluation
  3. 3. A History of In Memory Join Processing (and incidentally of this talk)• Foreign-Key Joins on CPUs • Naïve (Indexed Hash) Foreign-Key Joins • Clustered Foreign-Key Joins t• Foreign-Key Joins on GPGPUs • Clustered Foreign-Key Joins • Naïve (Asymmetric) Foreign-Key Joins
  4. 4. Premises for this talk• Focus is on Indexed Foreign Key Joins• We assume an unclustered Foreign Key Index Column• i.e., Pointers to Matching Tuples of the Target Table
  5. 5. Foreign-Key Joins on CPUs
  6. 6. The Problem: Bandwidth Bound CPUsCore 1 Core 2 ... Core 4 Memory Controller • Bottleneck for almost all relational (Memory 8.5 GB/s Resident) DBMS operations 8.5 GB/s • Valuable Target forMemory Memory Memory Optimization
  7. 7. Naïve Foreign-Key Joins on CPUs
  8. 8. Naïve Foreign-Key Joins on CPUs•Foreign-Key Index is scanned sequentially•Target Table is randomly accessed
  9. 9. Naïve Foreign-Key Joins on CPUs• Optimization Targets: • Foreign-Key Index is scanned sequentially • Index Cache lines accessed once, i.e., not much to optimize
  10. 10. Naïve Foreign-Key Joins on CPUs• Large Target Tables cause heavy Cache Thrashing• Assume a (theoretical) 1-Slot Cache with 2-Value Cache Lines• How many Cache Misses do you get? } } • 80% of are Capacitive Misses 6 6• Target Table Lookups are rewarding optimization Targets
  11. 11. Clustered Foreign-Key Joins on CPUs
  12. 12. Clustered Foreign-Key Joins on CPUs• Require a Value-partitioned Foreign-Key Index• Foreign-Key Index Partitions are scanned• Target Table Random Access Locality is improved
  13. 13. Clustered Foreign-Key Joins on CPUs • Require a Value-partitioned Foreign-Key Index • Foreign-Key Index Partitions are scanned[ ][ ] • Target Table Random Access Locality is improved [ ][ ]
  14. 14. Clustered Foreign-Key Joins on CPUs• What is the number of Target Table Cache Misses?• So Cache behavior is greatly improved [ ][ ]• Also: (disjoint) Partitions can be processed in parallel } • Parallelism } 1 1 is generally good for GPUs• But: The Index has to be Value (e.g., Radix)-partitioned first
  15. 15. Clustered Foreign-Key Joins on CPUs 6.0 Unpartitioned Join (CPU) Processing Time in seconds Partition (CPU) Join (CPU) 4.5 Radix Join (CPU) 3.0 about 30 % (consistent with [Blanas11]) 1.5 0 1 10 100 1000 10000 Number of Partitions in Clustering Step (Target Table Size: 64M)
  16. 16. Clustered Foreign-Key Joins on CPUs 25.00 CPU GPU Partition time in Seconds 20.512 18.75 12.50 11.040 6.25 6.492860 1.695271 0 256 Clusters 65536 Clusters
  17. 17. Foreign-Key Joins on GPUs
  18. 18. GPU Architecture Overview Processing Unit 1 Processing Unit 16 Instruction Dispatcher Instruction Dispatcher Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core ..... Core Core Core Core Core Core Core Core Core Core Core Core Local Memory Local MemoryExpensive Atomic Operations Global Memory
  19. 19. The Problem: Bandwidth Bound GPUs Core 1 Core 2 ... Core 4 GPU GPU PCI-E x 16 177.4 (8 GB/s) GB/s I/O Memory Controller Hub QPI (25.6 GB/s)• PCI-EBottleneck is a big 8.5 GB/s Problem for GPUs 8.5 GB/s Graphics Graphics Memory Memory Memory Memory Memory
  20. 20. Clustered Foreign-Key Joins on GPUs
  21. 21. Clustered Foreign-Key Joins on GPUs• Port of the Radix-Join from CPU to GPU [He08]• Improved Cache Performance is Important on GPUs too• Allows efficient parallel processing of Partitions
  22. 22. Clustered Foreign-Key Joins on GPUs 6.0 Join (GPU) Partition (CPU) Processing Time in seconds 4.5 Radix Join (CPU/GPU) 3.0 about 11 % 1.5 about 70 % 0 1 10 100 1000 10000 Number of Partitions in Clustering Step
  23. 23. Naïve Foreign-Key Joins on GPUs
  24. 24. Naïve Foreign-Key Joins on GPUs• Primary Bottleneck is PCI-E, not GPU Memory• Idea: Accelerate FK Joins by exploiting the Strength of GPUs: • Random Access to target table• Limit the PCI-E traffic to a minimum: • Stream only the Index
  25. 25. Asymmetric Foreign-Key Joins on GPUs• Primary Bottleneck is PCI-E, not GPU Memory• Idea: Accelerate FK Joins by exploiting the Strength of GPUs: • Random Access to target table• Limit the PCI-E traffic to a minimum: • Stream only the Index
  26. 26. Asymmetric Foreign-Key Joins on GPUs 6.0 Join (GPU) Processing Time in seconds 4.5 Partition (CPU) Radix Join (CPU/GPU) 3.0 1.5 0 1 10 100 1000 10000 Number of Partitions in Clustering Step
  27. 27. Asymmetric Foreign-Key Joins on GPUs 6.0 Unpartitioned Join (GPU) Join (GPU) Processing Time in seconds 4.5 Partition (CPU) Radix Join (CPU/GPU) 3.0 1.5 about factor 3.5 0 1 10 100 1000 10000 Number of Partitions in Clustering Step
  28. 28. Asymmetric Foreign-Key Joins on GPUs• Message: Clustering improves performance on GPUs• But: Cluster Costs don’t pay off in Cache Friendliness• But what about parallel processing of Partitions?
  29. 29. But what about Parallelization 1.00 Total Costs Data transfer CostsProcessing time in Seconds 0.75 about 41 % (not enough) 0.50 0.25 0 1 10 100 1000 Number of Utilized Cores per Processor (Work Group Size)
  30. 30. Foreign-Key Joins on GPUs vs CPUs 1.100 Intel i7 860 Processing Time in Seconds ATI Radeon HD 5850 0.825 PCI-E Data Transfer Costs 0.550 0.275 0 1K 10K 100K 1000K 10000K 100000K Dimension Table Size
  31. 31. Foreign-Key Joins on GPUs vs CPUs 1.100 Intel i7 860 Processing Time in Seconds ATI Radeon HD 5850 (processing only) 0.825 PCI-E Data Transfer Costs 0.550 0.275 0 1K 10K 100K 1000K 10000K 100000K Dimension Table Size
  32. 32. Conclusion• Unclustered FK-Joins Cause Heavy Cache Thrashing• Thrashing can be offloaded to fast GPU memory• PCI-E Bandwidth can be saturated without Clustering• Naïve FK-Joins on GPUs outperform Clustered Joins on CPUs
  33. 33. Thank you for Listening
  34. 34. Questions? Feedback?

×