What is the key component for success? In other words, what you do with your MySQL Server – in terms of physical design, schema design, and performance design – will be the biggest factor on whether a BI system hits the mark… * Philip Russom, “Next Generation Data Warehouse Platforms”, TDWI, 2009. *
What technologies you should be looking at * Philip Russom, “Next Generation Data Warehouse Platforms”, TDWI, 2009. *
Row or column-based engine? Medium-very large data Small-medium data Very dynamic; query patterns change Know exactly what to index; won’t change Need very fast loads; little DML Will be doing lots of single inserts/deletes Only need subset of columns for query Will need most columns in a table for query Yes, Column-based tables! Yes, Row-based tables!
Column vs. row orientation A column-oriented architecture looks the same on the surface, but stores data differently than legacy/row-based databases…
Example: InfiniDB vs. “Leading” row DB InfiniDB takes up 22% less space InfiniDB loaded data 22% faster InfiniDB total query times were 65% less InfiniDB average query times were 59% less Notice not only are the queries faster, but also more predictable * Tests run on standalone machine: 16 CPU, 16GB RAM, CentOS 5.4 with 2TB of raw data
Really are row-oriented architectures that store data in ‘column families, which are expected to be accessed together (remember logical vertical partitioning?) Individual columns cannot be accessed independently
Will be faster with individual insert and delete operations
Will normally be faster with single row requests
Will lag in typical analytic / data warehouse use cases
Note that striping only works for some engines (e.g. MyISAM, Archive) and for only certain operating systems (e.g. the option is ignored on Windows). You can use the REORGANIZE PARTITION command to move current partitions to new devices.
-> c3 > date '1995-01-01' and c3 < date '1995-12-31';
Query OK, 805114 rows affected (47.41 sec)
Most data warehouses have pruning or obsolete data operations that remove unwanted data. Using partitioning allows you to much more quickly and efficiently remove obsolete data: mysql> alter table t1 drop partition p1; Query OK, 0 rows affected (0.03 sec) Records: 0 Duplicates: 0 Warnings: 0 VS. The DROP PARTITION is A DDL operation, which runs much faster than a DML DELETE.
If query patterns are known and predictable, and data is relatively static, then indexing isn’t that difficult
If the situation is a very ad-hoc environment, indexing becomes more difficult. Must analyze SQL traffic and index the best you can
Over-indexing a table that is frequently loaded / refreshed / updated can severely impact load and DML performance. Test dropping and re-creating indexes vs. doing in-place loads and DML. Realize, though, any queries will be impacted from dropped indexes
Index maintenance (rebuilds, etc.) can cause issues in MySQL (locking, etc.)
Remember some storage engines don’t support normal indexes (Archive, CSV)
Remember that a benefit of (most) column databases is that they do not need or use indexes
Move the data as close to the database as possible; avoid having applications on remote machines do data manipulations and send data across the wire a row at a time – perhaps the worst way to load data
Oftentimes good to create staging tables then use procedural language to do data modifications and/or create flat files for high speed loaders
Loading data via time-based order helps some column databases like InfiniDB; logical range partitioning is then possible