Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

M|18 Understanding the Architecture of MariaDB ColumnStore

296 views

Published on

M|18 Understanding the Architecture of MariaDB ColumnStore

Published in: Technology
  • Be the first to comment

  • Be the first to like this

M|18 Understanding the Architecture of MariaDB ColumnStore

  1. 1. MariaDB ColumnStore Understanding the Architecture Andrew Hutchings (LinuxJedi) Lead Software Engineer, MariaDB ColumnStore
  2. 2. Who Am I? ● Andrew Hutchings, aka “LinuxJedi” ● Lead Software Engineer for MariaDB’s ColumnStore ● Previous worked for: ○ NGINX - Senior Developer Advocate / Technical Product Manager ○ HP - Principal Software Engineer (HP Cloud / ATG) ○ SkySQL - Senior Sustaining Engineer ○ Rackspace - Senior Software Engineer ○ Sun/Oracle - MySQL Senior Support Engineer ● Co-author of MySQL 5.1 Plugin Development ● IRC/Twitter: LinuxJedi ● EMail: linuxjedi@mariadb.com
  3. 3. Overview ● History of MariaDB ColumnStore ● Technical Use Case ● Components of MariaDB ColumnStore ● Disk Storage ● Writing Data ● Querying Data ● Optimizing for MariaDB ColumnStore ● Closing Notes ● Questions
  4. 4. History of MariaDB ColumnStore ● March 2010 - Calpont launches InfiniDB ● September 2014 - Calpont (now itself called InfiniDB) closes down ○ MariaDB (then SkySQL) supports InfiniDB customers ● April 2016 - MariaDB announces development of MariaDB ColumnStore ● August 2016 - I joined MariaDB and jumped straight into ColumnStore ● December 2016 - MariaDB ColumnStore 1.0 GA ○ InfiniDB + MariaDB 10.1 + Many fixes and improvements ● November 2017 - MariaDB ColumnStore 1.1 GA ○ MariaDB 10.2 + APIs + Even more improvements
  5. 5. Technical Use Case
  6. 6. Technical Use Case MariaDB ColumnStore ● Very large data sets ○ Many columns ○ Many millions of rows ● Complex joins and aggregates ● Rapid bulk data insertion ○ The larger the batch the better Traditional OLTP Engines ● Smaller data sets ● Basic queries ● Lots of DML queries ● Complex data types
  7. 7. Data Types ● INT types - range is 2 less from max unsigned or min signed ● CHAR† - max 255 bytes ● VARCHAR† - max 8000 bytes ● DECIMAL - max 18 digits ● DOUBLE/FLOAT ● DATETIME - no sub-seconds (coming in 1.2) ● DATE ● BLOB/TEXT† † Empty string is the same as NULL
  8. 8. Other DDL Differences ● No indexes ○ Columns are somewhat self-indexing ● Auto increment is handled differently (a table comment) ● No constraints ● PARTITION syntax not supported ○ Columns are partitioned automatically
  9. 9. Row-oriented vs. Column-oriented Format ID Fname Lname State Zip Phone Age Sex 1 Bugs Bunny NY 11217 (718) 938-3235 34 M 2 Yosemite Sam CA 95389 (209) 375-6572 52 M 3 Daffy Duck NY 10013 (212) 227-1810 35 M 4 Elmer Fudd ME 04578 (207) 882-7323 43 M 5 Witch Hazel MA 01970 (978) 744-0991 57 F ID 1 2 3 4 5 Fname Bugs Yosemite Daffy Elmer Witch Lname Bunny Sam Duck Fudd Hazel State NY CA NY ME MA Zip 11217 95389 10013 04578 01970 Phone (718) 938-3235 (209) 375-6572 (212) 227-1810 (207) 882-7323 (978) 744-0991 Age 34 52 35 43 57 Sex M M M M F SELECT Fname FROM People WHERE State = 'NY'
  10. 10. Components of MariaDB ColumnStore
  11. 11. ColumnStore Modules ● User Module (UM) ○ MariaDB Server / Storage Engine Plugin ○ ExeMgr ○ DMLProc, DDLProc ● Performance Module (PM) ○ PrimProc ○ WriteEngine ○ ProcMgr / ProcMon ○ DBRM
  12. 12. Query Processing Shared Nothing Distributed Data Storage SQL Column Primitives User Module Performance Module UM PM Primitives ↓↓↓↓ Intermediate ↑↑Results↑↑
  13. 13. Hardware Requirements ● Lots of RAM ○ minimum 32GB for UM, 16GB for PM ○ minimum 4GB for trying single server out on a VM ● Optimised for HDD spindles, will still work with SSD ○ We are looking into SSD optimisation soon ● More cores typically better ○ 8 core minimum recommendation ● For AWS m4.4xlarge is the recommended minimum
  14. 14. Disk Storage
  15. 15. Column Types • 8-byte fixed length token (pointer). • A variable length value stored at the location identified by the pointer. 1-byte Field with 8192 values per 8k block 2-byte Field with 4096 values per 8k block 4-byte Field with 2048 values per 8k block 8-byte Field with 1024 values per 8k block Dictionary structure made up of 2 files/extents with:
  16. 16. Extent Map Object ID The ID for the column (or dictionary) Object Type Column or Dictionary LBID Start / End Start / End Logical Block Pointer Minimum Value Lowest value in the extent Maximum Value Highest value in the extent Width Column Width DBRoot DBRoot (disk partition) number Partition ID / Segment ID / Block Offset The extent number High Water Mark Atomic last block pointer
  17. 17. Disk Storage Blocks (8KB) Extent1 (8MB~64MB 8 million rows) Logical Layer Segment File1 (maps to an Extent) Physical Layer Compression Chunks
  18. 18. Writing Data
  19. 19. Inserting Data ● Multiple methods ○ Single INSERTs ○ INSERT...SELECT ○ LOAD DATA INFILE ○ cpimport ○ Bulk Write API ● Designed for large bulk inserts ● Inserts are appended at the end of extents (or new extents created) ○ This means reads are not affected ○ A High Water Mark pointing to the last block is moved at the end of the insert
  20. 20. cpimport ● Uses CSV files or piped CSV data ● Fastest way to get data into ColumnStore ● Does minimal data conversion and pipes it straight into the PMs ○ Works by appending new blocks to the table and moving an atomic block pointer (HWM) ○ No UNDO log needed (atomic pointer not moved on rollback) ○ Therefore can cause a gap of 0-64KB in a column ● Can load multiple tables simultaneously ● Can load into multiple PMs for the same table simultaneously ● Can load into specific PMs for physical partitioning by PM
  21. 21. Bulk Write API ● A simple C++ API to inject data into the PMs ○ Bindings in Python and Java available ● Works in a similar way to cpimport ○ Append new blocks and an atomic block pointer (HWM) ● LGPL licensed
  22. 22. DML Writes ● Regular INSERT / UPDATE / DELETE ○ Also INSERT...SELECT and LOAD DATA INFILE when autocommit is off ● Slow compared to other engines ○ INSERT is very slow compared to cpimport ● Requires the use of a version buffer for an undo log ○ But INSERT appends to data blocks so no wasted storage ● Data sent to DMLProc to process
  23. 23. A Note About DELETE ● Need to touch every column and the undo log ○ So very slow ● Also leaves a gap in the column that won’t be filled ● Having a column that is marked using an UPDATE query is faster ● Dropping entire partitions is instantaneous ○ Partitions can be disabled first
  24. 24. INSERT...SELECT / LOAD DATA INFILE ● Injects the binary row data from MariaDB into cpimport ● Good for backwards compatibility with tools and remote loading ● cpimport then injects this data into the column extent files ○ In 1.2 it will use the write API instead ● If autocommit is turned off this will behave like regular DML instead (slow)
  25. 25. Querying Data
  26. 26. Physical Execution Layout Round Robin MariaDB Client MariaDB Server ExeMgr ExeMgr PrimProc PrimProc PrimProc PrimProc
  27. 27. Extent Elimination Horizontal Partition: 8 Million Rows Extent 2 Horizontal Partition: 8 Million Rows Extent 3 Horizontal Partition: 8 Million Rows Extent 1 Storage Architecture reduces I/O • Only touch column files that are in filter, projection, group by, and join conditions • Eliminate disk block touches to partitions outside filter and join conditions Extent 1: ShipDate: 2016-01-12 - 2016-03-05 Extent 2: ShipDate: 2016-03-05 - 2016-09-23 Extent 3: ShipDate: 2016-09-24 - 2017-01-06 SELECT Item, sum(Quantity) FROM Orders WHERE ShipDate between ‘2016-01-01’ and ‘2016-01-31’ GROUP BY Item Id OrderId Line Item Quantity Price Supplier ShipDate ShipMode 1 1 1 Laptop 5 1000 Dell 2016-01-12 G 2 1 2 Monitor 5 200 LG 2016-01-13 G 3 2 1 Mouse 1 20 Logitech 2016-02-05 M 4 3 1 Laptop 3 1600 Apple 2016-01-31 P ... ... ... ... ... ... ... ... ... 8M 2016-03-05 8M+1 2016-03-05 ... ... ... ... ... ... ... ... ... 16M 2016-09-23 16M+1 2016-09-24 ... ... ... ... ... ... ... ... ... 24M 2017-01-06 ELIMINATED PARTITION ELIMINATED PARTITION
  28. 28. Query Analysis MariaDB [tpch1]> select calsettrace(1); ... MariaDB [tpch1]> select c_count, count(*) as custdist -> from ( select c_custkey, count(o_orderkey) c_count -> from v_customer left outer join v_orders on c_custkey = o_custkey -> and o_comment not like '%special%requests%' -> group by c_custkey ) c_orders -> group by c_count -> order by custdist desc, c_count desc; ... 42 rows in set, 1 warning (9.07 sec) MariaDB [tpch1]> select calgetstats()G *************************** 1. row *************************** calgetstats(): Query Stats: MaxMemPct-4; NumTempFiles-0; TempFileSpace-0B; ApproxPhyI/O-0; CacheI/O-12503; BlocksTouched-12503; PartitionBlocksEliminated-812; MsgBytesIn-102MB; MsgBytesOut-3KB; Mode-Distributed 1 row in set (0.00 sec)
  29. 29. Query Analysis MariaDB [tpch1]> select calgettrace()G *************************** 1. row *************************** calgettrace(): Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows BPS PM customer 7254 (c_custkey) 0 75 0 0.032 150000 TNS UM - - - - - - 0.045 150000 BPS PM customer 7254 (c_custkey) 0 0 75 0.000 0 TNS UM - - - - - - 0.000 0 TUS UM - - - - - - 0.303 150000 BPS PM orders 7268 (o_comment,o_custkey,o_orderkey) 0 12428 0 2.293 1500000 TNS UM - - - - - - 2.967 1500000 BPS PM orders 7268 (o_comment,o_custkey,o_orderkey) 0 0 737 0.000 0 TNS UM - - - - - - 0.000 0 TUS UM - - - - - - 3.796 1500000 HJS UM v_customer-v_orders - - - - - ----- - TAS UM - - - - - - 1.658 150000 TNS UM - - - - - - 0.044 150000 TAS UM - - - - - - 0.050 42 1 row in set (0.01 sec)
  30. 30. Cross Engine Joins ● Allows non-ColumnStore tables to join with ColumnStore ● The whole query is processed by ColumnStore ● Cross Engine makes new MariaDB connections to retrieve data from non-ColumnStore tables Original Query Non-ColumnStore Query (Cross Engine) MariaDB Server ExeMgr
  31. 31. Optimizing for MariaDB ColumnStore
  32. 32. Data Modeling ● Star-schema optimizations are generally a good idea ● Conservative data typing is very important ○ Especially around fixed-length vs. dictionary boundary (8 bytes) ○ IP Address vs. IP Number ● Break down compound fields into individual fields: ○ Trivializes searching for sub-fields ○ Can avoid dictionary overhead ○ Cost to re-assemble is generally small
  33. 33. Data Insertion ● Order data as best you can before inserting ○ Helps extent elimination when min/max range for an extent is small ● Insert in large batches using cpimport or bulk write API
  34. 34. Improving Your Queries ● Avoid filtering on a >= 8byte VARCHAR/CHAR column where possible ○ Two extents need to be read per column, no extent elimination ● Use extent map elimination where possible ● Don’t use a function to filter ○ Extent elimination won’t happen ● Only reference required columns, avoid “SELECT *” ● Use the smallest possible data type for your data ● Avoid large ORDER BY ● Read https://mariadb.com/kb/en/mariadb/columnstore-performance-tuning/
  35. 35. Tuning ● Generally self-tuning ○ Uses as much RAM as possible automatically ○ Uses all CPU cores ● More RAM in PMs = more LRU data cache ● More RAM in UMs = ability to process aggregates / joins on bigger data sets ○ Disk joins are possible
  36. 36. Closing Notes
  37. 37. MariaDB ColumnStore 1.2 (later in 2018) ● MariaDB 10.3 base ● TIME datatype ● Microsecond support ● Improvements to LOAD DATA INFILE and INSERT...SELECT ● Phase 1 of MariaDB ColumnStore Storage Engine Convergence project ● Many other cool things
  38. 38. Thank you! linuxjedi@mariadb.com Twitter: @linuxjedi

×