B17 Eliminating the database bottleneck

537 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
537
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
31
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

B17 Eliminating the database bottleneck

  1. 1. Eliminating the DatabaseBottleneckWhat makes Vectorwise so fastMark Van de WielDirector Product Management, VectorwiseThursday, November 01, 20121 of 9 1 of 9Confidential © 2012 Actian Corporation
  2. 2. Agenda Why traditional RDBMSs are slow for analytics Why Vectorwise is fast The I/O challenge Efficient updates Confidential © 2012 Actian Corporation 2
  3. 3. 100x (+) Performance Difference – 2003Custom C versus Relational Database TPC-H 1 GB query 1 (runtime in s)30 28.1 26.22520 MySQL15 DBMS X C program10 Vectorwise 5 0.2 0.6 0 MySQL DBMS X C program Vectorwise Confidential © 2012 Actian Corporation 3
  4. 4. Traditional Relational Database for AnalyticsInefficiencies Inefficient storage Inefficient processing Confidential © 2012 Actian Corporation 4
  5. 5. Inefficient Storage for Analytics Row-based storage model Predominant in 2003, still very common today Works well for OLTP 101 Joe 27 Black 103 Edward 21 Scissorhand Confidential © 2012 Actian Corporation 5
  6. 6. Inefficient Storage – Row-based Pages on disk – example 101 27 Joe Black 103 21 Edward Scissorhand Var-width attribute pointers pointers to tuples Confidential © 2012 Actian Corporation 6
  7. 7. Issues with Row-based Storage Always read all attributes Poor bandwidth Poor use of memory buffer Complex row structure and navigation E.g. compressing out null fields E.g. row chaining Confidential © 2012 Actian Corporation 7
  8. 8. Efficient Storage for Analytics Columnar storage: store attributes separtely Retrieve only attributes required by the query Used by “traditional” column stores, e.g. Sybase IQ, Vertica Confidential © 2012 Actian Corporation 8
  9. 9. Inefficient ProcessingHow a traditional database runs a query Query: SELECT name, salary*.19 AS tax FROM employee WHERE age > 25 Confidential © 2012 Actian Corporation 9
  10. 10. Inefficient ProcessingHow a traditional database runs a query Tuple-at-a-time iterator interface: - open() - next(): tuple - close() next() is called: - for each operator - for each tuple Complex code repeated over and over Confidential © 2012 Actian Corporation 10
  11. 11. Inefficient ProcessingHow a traditional database runs a query Data-specific computational functionality Called once for every operation on every tuple Worse for complex tuple representations Confidential © 2012 Actian Corporation 11
  12. 12. Inefficient Processing (Part 1 of 2) Lots of repeated, unnecessary code Operator logic Function calls Attribute access Most instructions interpreting a query Very few instructions processing actual data! Many instructions per tuple Confidential © 2012 Actian Corporation 12
  13. 13. CPU Features – Inefficient Processing Part 2 In the last 20 years… Chip cache because RAM access is too slow and congested Branch-sensitive CPU pipelines Superscalar features SIMD instructions (SSE and AVX) Great for multimedia processing, scientific computing… … but NOT for traditional relational databases Complex code: function calls, branches Poor use of CPU cache (both data and instructions) Processing one value at a time Confidential © 2012 Actian Corporation 13
  14. 14. Inefficient ProcessingTraditional RDBMS Many instructions per tuple Many cycles per instruction Very many cycles per tuple Confidential © 2012 Actian Corporation 14
  15. 15. Vectorwise – Vector-based Processing Query: SELECT name, salary*.19 AS tax FROM employee WHERE age > 25 Confidential © 2012 Actian Corporation 15
  16. 16. Vectorwise – Vector-based Processing Vector contains data of multiple tuples (1024) All operations consume and produce entire vectors Effect: much less operator.next() and primitive calls. AND: pipelined query evaluation Confidential © 2012 Actian Corporation 16
  17. 17. Why is Vectorwise so Fast? Reduced interpretation overhead 100+ times fewer function calls Good CPU cache use High locality in primitives Cache-conscious algorithms No tuple navigation Primitives only see arrays Vectorization allows algorithmic optimization CPU and compiler-friendly function bodies Multiple work units, loop-pipelining, SIMD… BONUS: PARALLEL QUERY Confidential © 2012 Actian Corporation 17
  18. 18. Some Numbers Traditional RDBMS: <200 MB/s per core Vectorwise (lab environment): >1.5 GB/s per core Confidential © 2012 Actian Corporation 18
  19. 19. Addressing the I/O Challenge Columnar storage Smart column buffer (memory) Data compression On disk: less I/O In memory: best use of column buffer Ultra-efficient decompression algorithms to get sufficient throughput Large contiguous data blocks for optimum disk I/O In-memory min-max indexes per block (i.e. per column) Eliminate data blocks based on implicit/explicit filter criteria Confidential © 2012 Actian Corporation 19
  20. 20. Efficient Updates in a Column StorePositional Delta Trees (PDTs) In-memory representation of small data changes Efficiently merged with on-disk data Periodically propagated to disk Provide snapshot read consistency ACID compliant Confidential © 2012 Actian Corporation 20
  21. 21. Agenda Why traditional RDBMSs are slow for analytics Why Vectorwise is fast The I/O challenge Efficient updates Confidential © 2012 Actian Corporation 21
  22. 22. Confidential © 2012 Actian Corporation

×