Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Database Design most common pitfalls


Published on

It's easy to see antipatterns in production databases. Our schemas should be simple but extensible, and allow fast SQL queries. In this webinar I discuss what most common antipatterns are, and how to correct them.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Database Design most common pitfalls

  1. 1. Database Design most common pitfalls
  2. 2. € whoami ● Federico Razzoli ● Freelance consultant ● Working with databases since 2000 ● I worked as a consultant for Percona and Ibuildings (mainly MySQL and MariaDB) ● I worked as a DBA for fast-growing companies like Catawiki, HumanState, TransferWise
  3. 3. Agenda We will talk about… ● The most common design bad practices ● Information that is not easy to represent ● Relational model: why? ● Keys and indexes ● Data types ● Abusing NULL ● Hierarchies (trees) ● Lists ● Inheritance & polymorphism ● Heterogeneous rows ● Misc
  4. 4. Criteria
  5. 5. Criteria ● Queries should be fast ● Data structures should be reasonably simple ● Design must be reasonably extendable
  6. 6. Why Relational?
  7. 7. Specific Use Cases ● Some databases are designed for specific use cases ● In those cases, they may work much better than generic technologies ● Using them when not necessary may lead to use many technologies ● A technology should only be introduced if our company has: ○ Skills ○ Knowledge necessary for troubleshooting ○ Backups ○ High Availability ○ ...
  8. 8. Relational is flexible With the relational model we: ● Are sure that data is written correctly (transactions) ● Can make sure that data is valid (schema, integrity constraints) ● Design tables with access patterns in mind ● To run a query we initially didn’t consider, most of the times we can just add an index
  9. 9. Flexibility example CREATE TABLE user ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, name VARCHAR(100) NOT NULL, surname VARCHAR(100) NOT NULL, email VARCHAR(100) NOT NULL UNIQUE ); SELECT * FROM user WHERE id = 24; SELECT name, surname FROM user WHERE email = ''; CREATE INDEX idx_surname_name ON user (surname, name); SELECT name, surname FROM user WHERE surname LIKE 'B%' ORDER BY surname, name;
  10. 10. When Relational is not a good fit ● Heterogeneous data (product catalogue) ● Searchable text ● Graphs ● … However, for simple use cases relational databases include non-relational features, like: ● JSON type and functions ● Arrays (PostgreSQL) ● Fulltext indexes ● ...
  11. 11. Keys and Indexes
  12. 12. Primary Key ● Column or set of columns that identifies each row (unique, not null) ● Usually you want to create an artificial column for this: ○ id ○ or uuid
  13. 13. Poor Primary Keys ● No primary key! ○ In MySQL this causes many performance problems ○ CDC applications need a way to identify each row ● Wrong columns ○ email ■ An email can change over time ■ An email address can be assigned to another person ■ The primary key is a PII! ○ name (eg: city name, product name…) ■ Quite long, especially if it must be UTF-8 ■ Certain names can change over time ○ timestamp ■ Multiple rows could be created at the same timestamp! ■ Long ○ ...
  14. 14. UNIQUE ● An index whose values are distinct, or NULL ● Could theoretically be a primary key, but it’s not
  15. 15. Poor UNIQUE keys ● Columns whose values will always be distinct, no matter if there is an index or not ○ Enforcing unicity implies extra reads, possibly on disk ● Columns that could have duplicates, but they’re unlikely ○ timestamp ○ (last_name, first_name)
  16. 16. Foreign Keys ● References to another table (user.city_id -> ● In most cases they are bad for performance ● They create problems for operations (ALTER TABLE) ● In MySQL they are not compatible with some other features ○ They don’t activate triggers ○ Table partitioning ○ Tables not using InnoDB ○ Many bugs
  17. 17. Indexing Bad Practices ● Indexing all columns: it won’t work ● Multi-columns indexes in random order ● Indexing columns with few distinct values (eg, boolean) ○ Unless you know what you’re doing ● Indexes contained in other indexes: idx1 (email), idx2 (email, last_name) idx (email, id) UNIQUE unq1 (email), INDEX idx1 (email) ● Non-descriptive index names (like the ones above) Looking at an index name (EXPLAIN), I should know which columns it contains
  18. 18. Quick hints ● Learn how indexes work ○ Google: Federico Razzoli indexes bad practices ● Use pt-duplicate-key-checker, from Percona Toolkit
  19. 19. Data Types
  20. 20. Integer Types ● Don’t use bigger types than necessary ● ...but don’t overoptimise when you are not 100% sure. You’ll hardly see a benefit using TINYINT instead of SMALLINT ● MySQL UNSIGNED is good, column’s max is double ● I discourage the use of exotic MySQL syntax like: ○ MEDIUMINT: non-standard, and 3-bytes variables don’t exist in nature ○ INT(length) ○ ZEROFILL
  21. 21. Real Numbers ● FLOAT and DOUBLE are fast when aggregating many values ● But they are subject to approximation. Don’t use them for prices, etc ● Instead you can use: ○ DECIMAL ○ INT - Multiply a number by 100, for example ○ DECIMAL is slower if heavy arithmetics is performed on many values ○ But storing a transformed value (price*100) can lead to misunderstandings and bugs
  22. 22. Text Values ● Be sure that VARCHAR columns have adequate size for your data ● In PostgreSQL there is no difference between VARCHAR and TEXT, except that for VARCHAR you specify a max size ● In MySQL TEXT and BLOB columns are stored separately ○ Less data read if you often don’t read those columns ○ More read operations if you always use SELECT * ● CHAR is only good for small fixed-size data. The space saving is tiny.
  23. 23. Temporal Types ● TIMESTAMP and DATETIME are mostly interchangeable ● MySQL YEAR is weird. 2-digit values meaning changes over time. Use SMALLINT inxtead. ● MySQL TIME is apparently weird and useless. But not if you consider it as an interval. (range: -838:59:59 .. 838:59:59) ● PostgreSQL has a proper INTERVAL type, which is surely better ● PostgreSQL allows to specify a timezone for each value (TIMESTAMP WITH TIMEZONE) ○ Timezones depend on policy, economy and religion. They may vary by 15 mins. Timezones are created, dismissed, and changed. In one case a timezone was changed by skipping a whole calendar day. ○ Never deal with timezones yourself, no one ever succeeded in history. Store all dates as UTC, use an external library for conversion.
  24. 24. ENUM, SET ● MySQL weird types that include a list of allowed string values ● With ENUM, any number of values from the list are allowed ● With SET, exactly one value from the list is allowed ● '' is always allowed, because. ● Specifying the value by index is allowed, so 0 could match '1' ● Adding, dropping and changing values requires an ALTER TABLE ○ And possibly a locking table rebuild
  25. 25. Instead of ENUM CREATE TABLE account ( state ENUM('active', 'suspended') NOT NULL, ... )
  26. 26. Instead of ENUM CREATE TABLE account ( state_id INT UNSIGNED NOT NULL, ... ) CREATE TABLE state ( id INT UNSIGNED PRIMARY KEY, state VARCHAR(100) NOT NULL UNIQUE ) INSERT INTO state (state) VALUES ('active'), ('suspended');
  27. 27. Abusing NULL
  28. 28. NULL anomalies mysql> SELECT NULL = 1 AS a, NULL <> 1 AS b, NULL IS NULL AS c, 1 IS NOT NULL AS d; +------+------+---+---+ | a | b | c | d | +------+------+---+---+ | NULL | NULL | 1 | 1 | +------+------+---+---+ -- This returns TRUE in MySQL: NULL <=> NULL AND 1 <=> 1
  29. 29. Problematic queries These queries will not return rows with age = NULL or approved = NULL ● WHERE year != 1994 ● WHERE NOT (year = 1994) ● WHERE year > 2000 ● WHERE NOT (year > 2000) ● WHERE approved != TRUE ● WHERE NOT approved And: SELECT CONCAT(year, ' years old') FROM user ...
  30. 30. Bad Reasons for NULL ● Because columns are NULLable by default ● To indicate that a value doesn’t exist ○ Use a special value instead: '' or -1 or 0 or … ○ But this is not always a bad reason: UNIQUE allows multiple NULLs ● Using your tables as spreadsheets
  31. 31. Spreadsheet Example CREATE TABLE user ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, first_name VARCHAR(100) NOT NULL, last_name VARCHAR(100) NOT NULL, email VARCHAR(100) NOT NULL, -- if a user may have multiple URL’s, let’s move them -- to a separate table: -- url { id, user_id, url } url_1 VARCHAR(100), url_2 VARCHAR(100), url_3 VARCHAR(100), url_4 VARCHAR(100), url_5 VARCHAR(100) );
  32. 32. Spreadsheet Example CREATE TABLE user ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, first_name VARCHAR(100) NOT NULL, last_name VARCHAR(100) NOT NULL, email VARCHAR(100) NOT NULL, -- if we may have users bank data or not, -- let’s move them to another table: -- bank { user_id, account_no, account_holder, ... } bank_account_no VARCHAR(50), bank_account_holder VARCHAR(100), bank_iban VARCHAR(100), bank_swift_code VARCHAR(5) );
  33. 33. Hierarchies
  34. 34. Category Hierarchies Antipattern: column-per-level TABLE product (id, category_name, subcategory_name, name, price, ..) ----- TABLE category (id, name) TABLE product (id, category_id, subcategory_id, name, price, ...) Possible problems: ● To add or delete a level, we need to add or drop a column ● A subcategory can be erroneously linked to multiple categories ● A category can be erroneously used as subcategory, and vice versa
  35. 35. Category Hierarchies A better way: TABLE category (id, parent_id, name) TABLE product (id, category_id, name, price, ...) Possible problems: ● Circular dependencies (must be prevented at application level)
  36. 36. Category Networks What if every category can have multiple parents? Antipattern: TABLE category (id, parent_id1, parent_id2, name)
  37. 37. Category Graphs If every category can have multiple parents, correct pattern: TABLE category (id, name) TABLE category_relationship (parent_id, child_id)
  38. 38. Antipattern: Parent List If every category can have multiple parents, correct pattern: TABLE category (id, name, parent_list) INSERT INTO category (parent_list, name) VALUES ('sports/football/wear', 'football shoes'); ● This antipattern is sometimes used because it simplifies certain aspects ● But it overcomplicates other aspects ● Also, up to recently MySQL and MariaDB did not support recursive queries, but now they do
  39. 39. Storing Lists
  40. 40. Tags Column ● Suppose you want to store user-typed tags for posts ● You may be tempted to: CREATE TABLE post ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, tags VARCHAR(200) ); INSERT INTO post (tags, ... ) VALUES (sunday,venus, ... );
  41. 41. Tags Column ● But what about this query? SELECT id FROM post WHERE tags LIKE '%sun%'; ● Mmm, maybe this is better: INSERT INTO post (tags, ... ) VALUES (',events,diversity,', ... ); SELECT id FROM post WHERE tags LIKE '%,sun,%'; However, this query cannot take advantage of indexes
  42. 42. Tag Table CREATE TABLE post ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, ... ); CREATE TABLE tag ( post_id INT UNSIGNED, tag VARCHAR(50), PRIMARY KEY (post_id, tag), INDEX (tag) ); It works. Queries will be able to use indexes.
  43. 43. Tag Array -- PostgreSQL CREATE TABLE post ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, tags TEXT[] ); CREATE INDEX idx_tags on post USING GIN (tags); -- MySQL CREATE TABLE post ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, tags JSON DEFAULT JSON_ARRAY(), INDEX idx_tags (tags) ); -- MariaDB can store JSON arrays, -- but since it cannot index them this solution is not viable
  44. 44. Inheritance And Polymorphism
  45. 45. Not So Different Entities ● Your DB has users, landlords and tenants ● Separate entities with different info ● But sometimes you treat them as one thing ● What to do?
  46. 46. Inheritance ● In the simplest case, they are just subclasses ● For example, landlords and tenants could be types of users ● Common properties are in the parent class -- relational way to represent it: TABLE user (id, first_name, last_name, email) TABLE landlord (id, user_id, vat_number) TABLE tenant (id, user_id, landlord_id) PostgreSQL allows to do this in a more object oriented way, with Table Inheritance
  47. 47. Different Entities ● But sometimes it’s better to consider them different entities ● Antipattern: Union View CREATE VIEW everyone AS (SELECT id, first_name, last_name FROM landlord) UNION (SELECT id, first_name, last_name FROM tenant) ; This makes some queries less verbose, at the cost of making them potentially very slow
  48. 48. Unicity Across Tables /1 ● But maybe both landlords and tenants have emails, and we want to make sure they are UNIQUE ● Question: is there a practical reason?
  49. 49. Unicity Across Tables /2 ● If it is necessary, you’re thinking about the problem in a wrong way ● If emails need be unique, they are a whole entity, so you’ll guarantee unicity on a single table TABLE landlord (id, first_name, last_name, vat_number) TABLE tenant (id, first_name, last_name, landlord_id) TABLE email (id, email UNIQUE, landlord_id, tenant_id) Bloody hell! The solution initially looks great, but linking emails to landlords or tenants in that way is horrific!
  50. 50. Unicity Across Tables /2bis Why? ● Cannot build foreign keys (I don’t recommend it, but…) ● If in the future we want to link emails to suppliers, employees, etc, we’ll need to add columns to the table
  51. 51. Unicity Across Tables /3 Even if we keep the landlord and tenant tables separated, we can create a superset called person. We decided it’s not a parent class, so it can just have an id column. Every landlord, tenant and email is linked to a person. TABLE landlord (id, person_id, first_name, last_name, vat_number) TABLE tenant (id, person_id, first_name, last_name, landlord_id) TABLE person (id) TABLE email (id, person_id, email UNIQUE)
  52. 52. Heterogeneous Rows
  53. 53. Catalog of Products Imagine we have a catalogue of products where: ● Every product has certain common characteristics ● It’s important to be able to run queries on all products ○ SELECT id FROM p WHERE qty = 0; ○ SELECT MAX(price) FROM p GROUP BY vendor; ● Each product type also has a unique set of characteristics
  54. 54. Antipattern: Stylesheet Table ● Keep all products in the same table ● Add a column for every characteristic that applies to at least one product ● Where a column doesn’t make sense, set to NULL Problems: ● Too many columns and indexes ○ Generally bad for query performance, especially INSERTs ○ Generally bad for operations (repair, backup, restore, ALTER TABLE…) ● Adding/removing a product type means to add/remove a set of columns ○ But in practice columns will hardly be removed and will remain unused ● NULL means both “no value for this product” and “doesn’t apply to this type of products”, leading to endless confusion
  55. 55. Antipattern: Table per Type ● Store products of different types in different tables Problems: ● Metadata become data ○ How to get the list of product types? ● Some queries become overcomplicated ○ Get the id’s of out of stock products ○ Most expensive product for each vendor
  56. 56. Hybrid ● A single table for characteristics common to all product types ● A separate table per product type, for non-common characteristics Problems: ● Many JOINs ● Adding/removing product types means to add/remove tables
  57. 57. Semi-Structured Data ● A single table for all products ● A regular column for each column common to all product types ● A semi-structured column for all type-specific characteristics ○ JSON, HStore… ○ Not arrays ○ Not CSV ● Proper indexes on unstructured data (depending on your technology) Problems: ● Still a big table ● Queries on semi-structured data may be complicated and not supported by ORMs
  58. 58. Antipattern: Entity,Attribute,Value TABLE entity (id, name) TABLE attribute (id, entity_id, name) TABLE value (id, attribute_id, value) ● Each product type is an entity ● Each type characteristics are stored in attribute ● Each product is a set of values Example: Entity { id: 24, name: "Bed" } Attribute [ { id: 123, entity_id: 24, name: "material" }, ... ] Value [ { id: 999, attribute_id: 123, value: "wood" } ]
  59. 59. Antipattern: Entity,Attribute,Value Problems: ● We JOIN 3 tables every time we want to get a single value! ● All values must be treated as texts ○ Unless we create multiple value tables: int_value, text_value... ○ Which means, even more JOINs
  60. 60. Misc Antipatterns
  61. 61. Names Beyond Comprehension ● I saw the following table names in production: ○ marco2015 # Marco was the table’s creation ○ jan2015 # jan was the month ○ tmp_tmp_tmp_fix ○ tmp_fix_fix_fix # Because symmetry is cool I forgot many other examples because... “Ultimate horror often paralyses memory in a merciful way.” ― H.P. Lovecraft
  62. 62. Data in Metadata ● Include data in table names ○ invoice_2020, invoice_2019, invoice_2018… ● User a year column instead ● If the table is too big, there are other ways to contain the problem (partitioning)
  63. 63. Bad Names in General ● A names should tell everyone what a table or column is ○ Even to new hires! ○ Even to you… in 5 years from now! ● Otherwise people have to look at other documentation sources ○ ….which typically don’t exist ● Names should follow a standard across all company databases ○ singular/plural, long/short names, ... ● So people don’t have to check how a table / column is called exactly
  64. 64. Thank you for listening! Telegram channel: open_source_databases