This talk will present SQream’s journey to building an analytics data warehouse powered by GPUs. SQream DB is an SQL data warehouse designed for larger than main-memory datasets (up to petabytes). It’s an on-disk database that combines novel ideas and algorithms to rapidly analyze trillions of rows with the help of high-throughput GPUs. We will explore some of SQream’s ideas and approaches to developing its analytics database – from simple prototype and tech demos, to a fully functional data warehouse product containing the most important features for enterprise deployment. We will also describe the challenges of working with exotic hardware like GPUs, and what choices had to be made in order to combine the CPU and GPU capabilities to achieve industry-leading performance – complete with real world use case comparisons.
As part of this discussion, we will also share some of the real issues that were discovered, and the engineering decisions that led to the creation of SQream DB’s high-speed columnar storage engine, designed specifically to take advantage of streaming architectures like GPUs.
4. FAST TO GET LOTS OF DATA IN
• Use GPU for loading
• 900 GB/s Memory Bandwidth
• Compress all the data
• Collect metadata
5. FAST TO GET LOTS OF DATA OUT
• Access with easy-to-use SQL
• Support standards like ODBC and JDBC
• 900 GB/s Memory Bandwidth for SQL operations
• Access raw data directly, without cubes, indexes
• SQream DB reads less data from disk, with compression
7. ARE GPUS INTERESTING FOR RUNNING SQL?
• Can they run SQL
• Can they run SQL faster
– If qualified yes, in what situations?
• Are there other issues to consider?
8. CAN GPUS RUN SQL?
Example SQL Physical Operator Implementation
select a+b, c * 5 from t select
(a.k.a
project/extend/rename)
thrust::transform
select a, count(*), sum(b),
avg(b) from t group by a
stream aggregate thrust::reduce_by_key
select a, b from t where a > 0.5 filter thrust::remove_if
select distinct a from t stream distinct thrust::unique
select a, b, c, d from t order
by a,b
sort thrust::sort
select * from t union all
select * from u
union all -
select * from t
inner join u using (a)
sort merge join (smj) simple implementation:
thrust::upper_bounds,
lower_bounds, unnest, gather
9. MARKETING HURDLES
• PCI-bottleneck means it will never work
• Columnar databases can't do joins
• GPUs can't accelerate SQL operations
• No-one will put a GPU in a server
• GPUs are not actually faster than CPUs
• A startup cannot make a production ready SQL DBMS
10. OTHER ISSUES
• Can you make a convincing demo?
• Can you turn it into a real product?
• Can you put GPUs in a data centre?
• Are GPUs a safe bet in the medium/long
term?
12. EARLY RESEARCH
• MonetDB/X100 talk
youtu.be/yrLd-3lnZ58
• Relational Joins on Graphics
Processors
www.cse.ust.hk/catalac/papers/
gpujoin_sigmod08.pdf
• Relational Query Co-Processing on
Graphics Processors
dl.acm.org/citation.cfm?id=162058
8
• Several Daniel Abadi papers
www.cs.umd.edu/~abadi/
13. THE EARLY SQREAM DB PROTOTYPES
• Original brief: OpenCL + Erlang + Haskell streaming IoT = World Domination!
• Generate thrust at query time
• SQL server plugin
• A real (but simple) DBMS with storage
14. OUR FIRST DBMS
• Run on data on disk
• Create and drop table
• Insert, insert select (and truncate)
• A wide range of queries:
e.g. select lists, joins, where, aggregates, order by, distinct
• Lots of external algorithms
15. WHY NOT POSTGRES?
Some downsides to Postgres
• No columnar - engine and storage
• No threads, Not distributed
• A big complex system
Some non-benefits:
• Parsing, syntax, and similar - Haskell makes this easy
• The storage and execution engine – very row based
Some things we miss:
• Wide range of features, data types, operations
• Extensibility
• Cost based optimiser
• Protocol/client compatibility
16. STEPS TOWARDS TODAY'S PRODUCT
Haskell Compiler
Parse SQL
Desugar to
Relational
Algebra
Optimize
Desugar to
Statement Plan
Network
Server
Runtime
Metadata
Database
Columnar
Storage
Tree Interpreter Building Blocks I/O Task Runner
17. SQREAM DB ARCHITECTURE
Statement Compiler
SQL Parser
Desugar & Optimize
Relational Algebra
Desugar & Optimize
Low-level stages
Execution Engine
Statement Tree Interpreter
Task Runners
I/O CPU GPU
Storage Layer
Metadata Database
+ Low-level transactions
server or in-process
Bulk Data Layer
Extent Extent Extent …
Storage Reorganizer
Tasks
Queue & Thread
Manager
Profiling Support
Memory Managers
Building
blocks
Building
blocks
Building
blocks
Connection &
Session
Manager
Concurrency
& Admission
Control
Desugar & Optimize
Small
Memory
Managers
Chunk
Memory
Managers
Spool
Memory
Managers
Linux FS
Cache
Prodder
18. SOME ARCHITECTURE DETAILS
• Haskell has the intelligence
• C++/CUDA does the heavy lifting
• Message passing, worker pools
• Bulk data memory centric
• Storage is append-only with background reorganization
19. STORAGE AND TRANSACTIONS
• Metadata database with relatively conventional transactions
• Append only storage layer with background reorganization
Transactions
• Serializable, with any kind of statement
• Run multiple queries concurrently with anything
• Run multiple inserts to the same table at the same time
• Cannot run multiple statements in a single transaction
• Other operations such as delete, truncate, and DDL use course grained exclusive
locking
20. USING GPUS EFFECTIVELY
• Good kernels
• Optimise around GPU memory
• Use large chunks, rechunk where necessary
• Avoid PCI transfers where possible
• Profiling
• Partitioning
22. HASH JOINS
• Can hashing run fast on the GPU?
• Answer from NVIDIA experts:
– in principle probably yes
– in practice, difficult to compete with sort-based algorithms
23. COMPRESSION
• GPU compression for typical columnar data
– e.g. Dictionary, RLE, Delta, Pfor + Combos
– Helps speed up IO and PCI transfer times
– in house code
• CPU compression for general data
– Helps speed up IO, but not PCI transfer times
– We use things like Snappy and LZ4
24. SOME FINAL THOUGHTS
• SQL analytics and GPUs are a natural fit
• GPUs can be very effective for big data/external
algorithms
• Lots of exciting work being done in non-SQL
analytics (not just on GPUs)
• Haskell is a big positive
• Building a commercial SQL DBMS is very difficult
• Building a SQL DBMS is a really satisfying thing to do
SQL GPU
25. HIGH THROUGHPUT, CONVERGED
• SQream DB is designed for high-throughput devices
• IBM Power Systems is the only NVLink CPU-to-GPU enabled architecture,
unlocking the potential of high-throughput accelerated computing
• The IBM AC922, with POWER9 and NVLINK can transfer data at up to 300GB/s,
almost 9.5x faster than PCIe 3.0 found in x86-based architectures, reducing
classic I/O bottlenecks
2x
NVIDIA
Tesla V100
2x
NVIDIA
Tesla V100
IBM
Power 9
IBM
Power 9
26. HIGH THROUGHPUT ARCHITECTURE
IT’S NOT JUST CORES
RAM
Power9
CPU
Tesla V100
GPU
VRAM
Tesla V100
GPU
VRAM
170GB/s per CPU
NVLink – 300GB/s BiDi
900GB/s
RAM
Power9
CPU
Tesla V100
GPU
VRAM
Tesla V100
GPU
VRAM
IBM SMP bus
27. UP TO 3.7X FASTER QUERIES
52.83
10.35
84.5
78.57
14.06
2.8
30.29 29.01
0
10
20
30
40
50
60
70
80
90
TPC-H Query 8 TPC-H Query 6 TPC-H Query 19 TPC-H Query 17
Querytime(seconds)
Lowerisbetter
Query
SQream DB performance
IBM Power9 vs Intel Xeon (Skylake)
Dell PowerEdge R740 IBM Power9 AC922
IBM Power9 AC922:
2x POWER9 16C @ 3.8GHz | 256 GB DDR4 2666 MHz | SSD storage | 4x NVIDIA Tesla V100 (SXM2 NVLINK - 16GB)
Dell PowerEdge R740:
2x Intel Xeon Silver 4112 CPU @ 2.60GHz | 256GB DDR4 2666MHz | SSD storage | 4x NVIDIA Tesla V100 (PCIe - 16GB)
• In our testing, SQream DB on Power9
is between 150% to 370% faster than
comparable x86 architectures,
especially on large data sets. For
example, in the TPC-H (SF 10,000)
dataset, Query 8 ran in a quarter of
the time on the IBM Power 9,
compared to the x86 competitor.
28. UNDERSTAND 40 MILLION CUSTOMERS
TELECOM
HP DL380g9
with NVIDIA Tesla GPU
96 GB RAM + 6 TB storage
$200K
80 NODES
5 full racks
7600 CPU cores
$10,000,000
20M
10M
300M
120M
Ingest time
Reporting time
Ownership Cost