Database Design
most common pitfalls
€ whoami
● Federico Razzoli
● Freelance consultant
● Working with databases since 2000
hello@federico-razzoli.com
federico-razzoli.com
● I worked as a consultant for Percona and Ibuildings
(mainly MySQL and MariaDB)
● I worked as a DBA for fast-growing companies like
Catawiki, HumanState, TransferWise
Agenda
We will talk about…
● The most common design bad practices
● Information that is not easy to represent
● Relational model: why?
● Keys and indexes
● Data types
● Abusing NULL
● Hierarchies (trees)
● Lists
● Inheritance & polymorphism
● Heterogeneous rows
● Misc
Criteria
Criteria
● Queries should be fast
● Data structures should be reasonably simple
● Design must be reasonably extendable
Why Relational?
Specific Use Cases
● Some databases are designed for specific use cases
● In those cases, they may work much better than generic technologies
● Using them when not necessary may lead to use many technologies
● A technology should only be introduced if our company has:
○ Skills
○ Knowledge necessary for troubleshooting
○ Backups
○ High Availability
○ ...
Relational is flexible
With the relational model we:
● Are sure that data is written correctly (transactions)
● Can make sure that data is valid (schema, integrity constraints)
● Design tables with access patterns in mind
● To run a query we initially didn’t consider, most of the times we can just add
an index
Flexibility example
CREATE TABLE user (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
surname VARCHAR(100) NOT NULL,
email VARCHAR(100) NOT NULL UNIQUE
);
SELECT * FROM user WHERE id = 24;
SELECT name, surname FROM user
WHERE email = 'picard@starfleet.earth';
CREATE INDEX idx_surname_name ON user (surname, name);
SELECT name, surname FROM user
WHERE surname LIKE 'B%'
ORDER BY surname, name;
When Relational is not a good fit
● Heterogeneous data (product catalogue)
● Searchable text
● Graphs
● …
However, for simple use cases relational databases include non-relational
features, like:
● JSON type and functions
● Arrays (PostgreSQL)
● Fulltext indexes
● ...
Keys and Indexes
Primary Key
● Column or set of columns that identifies each row (unique, not null)
● Usually you want to create an artificial column for this:
○ id
○ or uuid
Poor Primary Keys
● No primary key!
○ In MySQL this causes many performance problems
○ CDC applications need a way to identify each row
● Wrong columns
○ email
■ An email can change over time
■ An email address can be assigned to another person
■ The primary key is a PII!
○ name (eg: city name, product name…)
■ Quite long, especially if it must be UTF-8
■ Certain names can change over time
○ timestamp
■ Multiple rows could be created at the same timestamp!
■ Long
○ ...
UNIQUE
● An index whose values are distinct, or NULL
● Could theoretically be a primary key, but it’s not
Poor UNIQUE keys
● Columns whose values will always be distinct, no matter if there is an index or
not
○ Enforcing unicity implies extra reads, possibly on disk
● Columns that could have duplicates, but they’re unlikely
○ timestamp
○ (last_name, first_name)
Foreign Keys
● References to another table (user.city_id -> city.id)
● In most cases they are bad for performance
● They create problems for operations (ALTER TABLE)
● In MySQL they are not compatible with some other features
○ They don’t activate triggers
○ Table partitioning
○ Tables not using InnoDB
○ Many bugs
Indexing Bad Practices
● Indexing all columns: it won’t work
● Multi-columns indexes in random order
● Indexing columns with few distinct values (eg, boolean)
○ Unless you know what you’re doing
● Indexes contained in other indexes:
idx1 (email), idx2 (email, last_name)
idx (email, id)
UNIQUE unq1 (email), INDEX idx1 (email)
● Non-descriptive index names (like the ones above)
Looking at an index name (EXPLAIN),
I should know which columns it contains
Quick hints
● Learn how indexes work
○ Google: Federico Razzoli indexes bad practices
● Use pt-duplicate-key-checker, from Percona Toolkit
Data Types
Integer Types
● Don’t use bigger types than necessary
● ...but don’t overoptimise when you are not 100% sure. You’ll hardly see a
benefit using TINYINT instead of SMALLINT
● MySQL UNSIGNED is good, column’s max is double
● I discourage the use of exotic MySQL syntax like:
○ MEDIUMINT: non-standard, and 3-bytes variables don’t exist in nature
○ INT(length)
○ ZEROFILL
Real Numbers
● FLOAT and DOUBLE are fast when aggregating many values
● But they are subject to approximation. Don’t use them for prices, etc
● Instead you can use:
○ DECIMAL
○ INT - Multiply a number by 100, for example
○ DECIMAL is slower if heavy arithmetics is performed on many values
○ But storing a transformed value (price*100) can lead to
misunderstandings and bugs
Text Values
● Be sure that VARCHAR columns have adequate size for your data
● In PostgreSQL there is no difference between VARCHAR and TEXT, except
that for VARCHAR you specify a max size
● In MySQL TEXT and BLOB columns are stored separately
○ Less data read if you often don’t read those columns
○ More read operations if you always use SELECT *
● CHAR is only good for small fixed-size data. The space saving is tiny.
Temporal Types
● TIMESTAMP and DATETIME are mostly interchangeable
● MySQL YEAR is weird. 2-digit values meaning changes over time. Use
SMALLINT inxtead.
● MySQL TIME is apparently weird and useless. But not if you consider it as an
interval. (range: -838:59:59 .. 838:59:59)
● PostgreSQL has a proper INTERVAL type, which is surely better
● PostgreSQL allows to specify a timezone for each value (TIMESTAMP WITH
TIMEZONE)
○ Timezones depend on policy, economy and religion. They may vary by 15
mins. Timezones are created, dismissed, and changed. In one case a
timezone was changed by skipping a whole calendar day.
○ Never deal with timezones yourself, no one ever succeeded in history.
Store all dates as UTC, use an external library for conversion.
ENUM, SET
● MySQL weird types that include a list of allowed string values
● With ENUM, any number of values from the list are allowed
● With SET, exactly one value from the list is allowed
● '' is always allowed, because.
● Specifying the value by index is allowed, so 0 could match '1'
● Adding, dropping and changing values requires an ALTER TABLE
○ And possibly a locking table rebuild
Instead of ENUM
CREATE TABLE account (
state ENUM('active', 'suspended') NOT NULL,
...
)
Instead of ENUM
CREATE TABLE account (
state_id INT UNSIGNED NOT NULL,
...
)
CREATE TABLE state (
id INT UNSIGNED PRIMARY KEY,
state VARCHAR(100) NOT NULL UNIQUE
)
INSERT INTO state (state) VALUES ('active'), ('suspended');
Abusing NULL
NULL anomalies
mysql> SELECT
NULL = 1 AS a,
NULL <> 1 AS b,
NULL IS NULL AS c,
1 IS NOT NULL AS d;
+------+------+---+---+
| a | b | c | d |
+------+------+---+---+
| NULL | NULL | 1 | 1 |
+------+------+---+---+
-- This returns TRUE in MySQL:
NULL <=> NULL AND 1 <=> 1
Problematic queries
These queries will not return rows with age = NULL or approved = NULL
● WHERE year != 1994
● WHERE NOT (year = 1994)
● WHERE year > 2000
● WHERE NOT (year > 2000)
● WHERE approved != TRUE
● WHERE NOT approved
And:
SELECT CONCAT(year, ' years old') FROM user ...
Bad Reasons for NULL
● Because columns are NULLable by default
● To indicate that a value doesn’t exist
○ Use a special value instead: '' or -1 or 0 or …
○ But this is not always a bad reason: UNIQUE allows multiple NULLs
● Using your tables as spreadsheets
Spreadsheet Example
CREATE TABLE user (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
first_name VARCHAR(100) NOT NULL,
last_name VARCHAR(100) NOT NULL,
email VARCHAR(100) NOT NULL,
-- if a user may have multiple URL’s, let’s move them
-- to a separate table:
-- url { id, user_id, url }
url_1 VARCHAR(100),
url_2 VARCHAR(100),
url_3 VARCHAR(100),
url_4 VARCHAR(100),
url_5 VARCHAR(100)
);
Spreadsheet Example
CREATE TABLE user (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
first_name VARCHAR(100) NOT NULL,
last_name VARCHAR(100) NOT NULL,
email VARCHAR(100) NOT NULL,
-- if we may have users bank data or not,
-- let’s move them to another table:
-- bank { user_id, account_no, account_holder, ... }
bank_account_no VARCHAR(50),
bank_account_holder VARCHAR(100),
bank_iban VARCHAR(100),
bank_swift_code VARCHAR(5)
);
Hierarchies
Category Hierarchies
Antipattern: column-per-level
TABLE product (id, category_name, subcategory_name, name, price, ..)
-----
TABLE category (id, name)
TABLE product (id, category_id, subcategory_id, name, price, ...)
Possible problems:
● To add or delete a level, we need to add or drop a column
● A subcategory can be erroneously linked to multiple categories
● A category can be erroneously used as subcategory, and vice versa
Category Hierarchies
A better way:
TABLE category (id, parent_id, name)
TABLE product (id, category_id, name, price, ...)
Possible problems:
● Circular dependencies (must be prevented at application level)
Category Networks
What if every category can have multiple parents?
Antipattern:
TABLE category (id, parent_id1, parent_id2, name)
Category Graphs
If every category can have multiple parents, correct pattern:
TABLE category (id, name)
TABLE category_relationship (parent_id, child_id)
Antipattern: Parent List
If every category can have multiple parents, correct pattern:
TABLE category (id, name, parent_list)
INSERT INTO category (parent_list, name) VALUES
('sports/football/wear', 'football shoes');
● This antipattern is sometimes used because it simplifies certain aspects
● But it overcomplicates other aspects
● Also, up to recently MySQL and MariaDB did not support recursive queries,
but now they do
Storing Lists
Tags Column
● Suppose you want to store user-typed tags for posts
● You may be tempted to:
CREATE TABLE post (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
tags VARCHAR(200)
);
INSERT INTO post (tags, ... ) VALUES (sunday,venus, ... );
Tags Column
● But what about this query?
SELECT id FROM post WHERE tags LIKE '%sun%';
● Mmm, maybe this is better:
INSERT INTO post (tags, ... ) VALUES (',events,diversity,', ... );
SELECT id FROM post WHERE tags LIKE '%,sun,%';
However, this query cannot take advantage of indexes
Tag Table
CREATE TABLE post (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
...
);
CREATE TABLE tag (
post_id INT UNSIGNED,
tag VARCHAR(50),
PRIMARY KEY (post_id, tag),
INDEX (tag)
);
It works.
Queries will be able to use indexes.
Tag Array
-- PostgreSQL
CREATE TABLE post (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
tags TEXT[]
);
CREATE INDEX idx_tags on post USING GIN (tags);
-- MySQL
CREATE TABLE post (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
tags JSON DEFAULT JSON_ARRAY(),
INDEX idx_tags (tags)
);
-- MariaDB can store JSON arrays,
-- but since it cannot index them this solution is not viable
Inheritance
And
Polymorphism
Not So Different Entities
● Your DB has users, landlords and tenants
● Separate entities with different info
● But sometimes you treat them as one thing
● What to do?
Inheritance
● In the simplest case, they are just subclasses
● For example, landlords and tenants could be types of users
● Common properties are in the parent class
-- relational way to represent it:
TABLE user (id, first_name, last_name, email)
TABLE landlord (id, user_id, vat_number)
TABLE tenant (id, user_id, landlord_id)
PostgreSQL allows to do this in a more object oriented way, with Table Inheritance
Different Entities
● But sometimes it’s better to consider them different entities
● Antipattern: Union View
CREATE VIEW everyone AS
(SELECT id, first_name, last_name FROM landlord)
UNION
(SELECT id, first_name, last_name FROM tenant)
;
This makes some queries less verbose, at the cost of making them
potentially very slow
Unicity Across Tables /1
● But maybe both landlords and tenants have emails,
and we want to make sure they are UNIQUE
● Question: is there a practical reason?
Unicity Across Tables /2
● If it is necessary, you’re thinking about the problem in a wrong way
● If emails need be unique, they are a whole entity, so you’ll guarantee unicity
on a single table
TABLE landlord (id, first_name, last_name, vat_number)
TABLE tenant (id, first_name, last_name, landlord_id)
TABLE email (id, email UNIQUE, landlord_id, tenant_id)
Bloody hell! The solution initially looks great, but linking emails to landlords or
tenants in that way is horrific!
Unicity Across Tables /2bis
Why?
● Cannot build foreign keys (I don’t recommend it, but…)
● If in the future we want to link emails to suppliers, employees, etc, we’ll need
to add columns to the table
Unicity Across Tables /3
Even if we keep the landlord and tenant tables separated,
we can create a superset called person.
We decided it’s not a parent class, so it can just have an id column.
Every landlord, tenant and email is linked to a person.
TABLE landlord (id, person_id, first_name, last_name, vat_number)
TABLE tenant (id, person_id, first_name, last_name, landlord_id)
TABLE person (id)
TABLE email (id, person_id, email UNIQUE)
Heterogeneous Rows
Catalog of Products
Imagine we have a catalogue of products where:
● Every product has certain common characteristics
● It’s important to be able to run queries on all products
○ SELECT id FROM p WHERE qty = 0;
○ SELECT MAX(price) FROM p GROUP BY vendor;
● Each product type also has a unique set of characteristics
Antipattern: Stylesheet Table
● Keep all products in the same table
● Add a column for every characteristic that applies to at least one product
● Where a column doesn’t make sense, set to NULL
Problems:
● Too many columns and indexes
○ Generally bad for query performance, especially INSERTs
○ Generally bad for operations (repair, backup, restore, ALTER TABLE…)
● Adding/removing a product type means to add/remove a set of columns
○ But in practice columns will hardly be removed and will remain unused
● NULL means both “no value for this product” and “doesn’t apply to this type of
products”, leading to endless confusion
Antipattern: Table per Type
● Store products of different types in different tables
Problems:
● Metadata become data
○ How to get the list of product types?
● Some queries become overcomplicated
○ Get the id’s of out of stock products
○ Most expensive product for each vendor
Hybrid
● A single table for characteristics common to all product types
● A separate table per product type, for non-common characteristics
Problems:
● Many JOINs
● Adding/removing product types means to add/remove tables
Semi-Structured Data
● A single table for all products
● A regular column for each column common to all product types
● A semi-structured column for all type-specific characteristics
○ JSON, HStore…
○ Not arrays
○ Not CSV
● Proper indexes on unstructured data (depending on your technology)
Problems:
● Still a big table
● Queries on semi-structured data may be complicated and not supported by
ORMs
Antipattern: Entity,Attribute,Value
TABLE entity (id, name)
TABLE attribute (id, entity_id, name)
TABLE value (id, attribute_id, value)
● Each product type is an entity
● Each type characteristics are stored in attribute
● Each product is a set of values
Example:
Entity { id: 24, name: "Bed" }
Attribute [ { id: 123, entity_id: 24, name: "material" }, ... ]
Value [ { id: 999, attribute_id: 123, value: "wood" } ]
Antipattern: Entity,Attribute,Value
Problems:
● We JOIN 3 tables every time we want to get a single value!
● All values must be treated as texts
○ Unless we create multiple value tables: int_value, text_value...
○ Which means, even more JOINs
Misc Antipatterns
Names Beyond Comprehension
● I saw the following table names in production:
○ marco2015 # Marco was the table’s creation
○ jan2015 # jan was the month
○ tmp_tmp_tmp_fix
○ tmp_fix_fix_fix # Because symmetry is cool
I forgot many other examples because...
“Ultimate horror often paralyses memory in a merciful way.”
― H.P. Lovecraft
Data in Metadata
● Include data in table names
○ invoice_2020, invoice_2019, invoice_2018…
● User a year column instead
● If the table is too big, there are other ways to contain the problem
(partitioning)
Bad Names in General
● A names should tell everyone what a table or column is
○ Even to new hires!
○ Even to you… in 5 years from now!
● Otherwise people have to look at other documentation sources
○ ….which typically don’t exist
● Names should follow a standard across all company databases
○ singular/plural, long/short names, ...
● So people don’t have to check how a table / column is called exactly
Thank you for listening!
federico-razzoli.com/services
Telegram channel:
open_source_databases

Database Design most common pitfalls

  • 1.
  • 2.
    € whoami ● FedericoRazzoli ● Freelance consultant ● Working with databases since 2000 hello@federico-razzoli.com federico-razzoli.com ● I worked as a consultant for Percona and Ibuildings (mainly MySQL and MariaDB) ● I worked as a DBA for fast-growing companies like Catawiki, HumanState, TransferWise
  • 3.
    Agenda We will talkabout… ● The most common design bad practices ● Information that is not easy to represent ● Relational model: why? ● Keys and indexes ● Data types ● Abusing NULL ● Hierarchies (trees) ● Lists ● Inheritance & polymorphism ● Heterogeneous rows ● Misc
  • 4.
  • 5.
    Criteria ● Queries shouldbe fast ● Data structures should be reasonably simple ● Design must be reasonably extendable
  • 6.
  • 7.
    Specific Use Cases ●Some databases are designed for specific use cases ● In those cases, they may work much better than generic technologies ● Using them when not necessary may lead to use many technologies ● A technology should only be introduced if our company has: ○ Skills ○ Knowledge necessary for troubleshooting ○ Backups ○ High Availability ○ ...
  • 8.
    Relational is flexible Withthe relational model we: ● Are sure that data is written correctly (transactions) ● Can make sure that data is valid (schema, integrity constraints) ● Design tables with access patterns in mind ● To run a query we initially didn’t consider, most of the times we can just add an index
  • 9.
    Flexibility example CREATE TABLEuser ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, name VARCHAR(100) NOT NULL, surname VARCHAR(100) NOT NULL, email VARCHAR(100) NOT NULL UNIQUE ); SELECT * FROM user WHERE id = 24; SELECT name, surname FROM user WHERE email = 'picard@starfleet.earth'; CREATE INDEX idx_surname_name ON user (surname, name); SELECT name, surname FROM user WHERE surname LIKE 'B%' ORDER BY surname, name;
  • 10.
    When Relational isnot a good fit ● Heterogeneous data (product catalogue) ● Searchable text ● Graphs ● … However, for simple use cases relational databases include non-relational features, like: ● JSON type and functions ● Arrays (PostgreSQL) ● Fulltext indexes ● ...
  • 11.
  • 12.
    Primary Key ● Columnor set of columns that identifies each row (unique, not null) ● Usually you want to create an artificial column for this: ○ id ○ or uuid
  • 13.
    Poor Primary Keys ●No primary key! ○ In MySQL this causes many performance problems ○ CDC applications need a way to identify each row ● Wrong columns ○ email ■ An email can change over time ■ An email address can be assigned to another person ■ The primary key is a PII! ○ name (eg: city name, product name…) ■ Quite long, especially if it must be UTF-8 ■ Certain names can change over time ○ timestamp ■ Multiple rows could be created at the same timestamp! ■ Long ○ ...
  • 14.
    UNIQUE ● An indexwhose values are distinct, or NULL ● Could theoretically be a primary key, but it’s not
  • 15.
    Poor UNIQUE keys ●Columns whose values will always be distinct, no matter if there is an index or not ○ Enforcing unicity implies extra reads, possibly on disk ● Columns that could have duplicates, but they’re unlikely ○ timestamp ○ (last_name, first_name)
  • 16.
    Foreign Keys ● Referencesto another table (user.city_id -> city.id) ● In most cases they are bad for performance ● They create problems for operations (ALTER TABLE) ● In MySQL they are not compatible with some other features ○ They don’t activate triggers ○ Table partitioning ○ Tables not using InnoDB ○ Many bugs
  • 17.
    Indexing Bad Practices ●Indexing all columns: it won’t work ● Multi-columns indexes in random order ● Indexing columns with few distinct values (eg, boolean) ○ Unless you know what you’re doing ● Indexes contained in other indexes: idx1 (email), idx2 (email, last_name) idx (email, id) UNIQUE unq1 (email), INDEX idx1 (email) ● Non-descriptive index names (like the ones above) Looking at an index name (EXPLAIN), I should know which columns it contains
  • 18.
    Quick hints ● Learnhow indexes work ○ Google: Federico Razzoli indexes bad practices ● Use pt-duplicate-key-checker, from Percona Toolkit
  • 19.
  • 20.
    Integer Types ● Don’tuse bigger types than necessary ● ...but don’t overoptimise when you are not 100% sure. You’ll hardly see a benefit using TINYINT instead of SMALLINT ● MySQL UNSIGNED is good, column’s max is double ● I discourage the use of exotic MySQL syntax like: ○ MEDIUMINT: non-standard, and 3-bytes variables don’t exist in nature ○ INT(length) ○ ZEROFILL
  • 21.
    Real Numbers ● FLOATand DOUBLE are fast when aggregating many values ● But they are subject to approximation. Don’t use them for prices, etc ● Instead you can use: ○ DECIMAL ○ INT - Multiply a number by 100, for example ○ DECIMAL is slower if heavy arithmetics is performed on many values ○ But storing a transformed value (price*100) can lead to misunderstandings and bugs
  • 22.
    Text Values ● Besure that VARCHAR columns have adequate size for your data ● In PostgreSQL there is no difference between VARCHAR and TEXT, except that for VARCHAR you specify a max size ● In MySQL TEXT and BLOB columns are stored separately ○ Less data read if you often don’t read those columns ○ More read operations if you always use SELECT * ● CHAR is only good for small fixed-size data. The space saving is tiny.
  • 23.
    Temporal Types ● TIMESTAMPand DATETIME are mostly interchangeable ● MySQL YEAR is weird. 2-digit values meaning changes over time. Use SMALLINT inxtead. ● MySQL TIME is apparently weird and useless. But not if you consider it as an interval. (range: -838:59:59 .. 838:59:59) ● PostgreSQL has a proper INTERVAL type, which is surely better ● PostgreSQL allows to specify a timezone for each value (TIMESTAMP WITH TIMEZONE) ○ Timezones depend on policy, economy and religion. They may vary by 15 mins. Timezones are created, dismissed, and changed. In one case a timezone was changed by skipping a whole calendar day. ○ Never deal with timezones yourself, no one ever succeeded in history. Store all dates as UTC, use an external library for conversion.
  • 24.
    ENUM, SET ● MySQLweird types that include a list of allowed string values ● With ENUM, any number of values from the list are allowed ● With SET, exactly one value from the list is allowed ● '' is always allowed, because. ● Specifying the value by index is allowed, so 0 could match '1' ● Adding, dropping and changing values requires an ALTER TABLE ○ And possibly a locking table rebuild
  • 25.
    Instead of ENUM CREATETABLE account ( state ENUM('active', 'suspended') NOT NULL, ... )
  • 26.
    Instead of ENUM CREATETABLE account ( state_id INT UNSIGNED NOT NULL, ... ) CREATE TABLE state ( id INT UNSIGNED PRIMARY KEY, state VARCHAR(100) NOT NULL UNIQUE ) INSERT INTO state (state) VALUES ('active'), ('suspended');
  • 27.
  • 28.
    NULL anomalies mysql> SELECT NULL= 1 AS a, NULL <> 1 AS b, NULL IS NULL AS c, 1 IS NOT NULL AS d; +------+------+---+---+ | a | b | c | d | +------+------+---+---+ | NULL | NULL | 1 | 1 | +------+------+---+---+ -- This returns TRUE in MySQL: NULL <=> NULL AND 1 <=> 1
  • 29.
    Problematic queries These querieswill not return rows with age = NULL or approved = NULL ● WHERE year != 1994 ● WHERE NOT (year = 1994) ● WHERE year > 2000 ● WHERE NOT (year > 2000) ● WHERE approved != TRUE ● WHERE NOT approved And: SELECT CONCAT(year, ' years old') FROM user ...
  • 30.
    Bad Reasons forNULL ● Because columns are NULLable by default ● To indicate that a value doesn’t exist ○ Use a special value instead: '' or -1 or 0 or … ○ But this is not always a bad reason: UNIQUE allows multiple NULLs ● Using your tables as spreadsheets
  • 31.
    Spreadsheet Example CREATE TABLEuser ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, first_name VARCHAR(100) NOT NULL, last_name VARCHAR(100) NOT NULL, email VARCHAR(100) NOT NULL, -- if a user may have multiple URL’s, let’s move them -- to a separate table: -- url { id, user_id, url } url_1 VARCHAR(100), url_2 VARCHAR(100), url_3 VARCHAR(100), url_4 VARCHAR(100), url_5 VARCHAR(100) );
  • 32.
    Spreadsheet Example CREATE TABLEuser ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, first_name VARCHAR(100) NOT NULL, last_name VARCHAR(100) NOT NULL, email VARCHAR(100) NOT NULL, -- if we may have users bank data or not, -- let’s move them to another table: -- bank { user_id, account_no, account_holder, ... } bank_account_no VARCHAR(50), bank_account_holder VARCHAR(100), bank_iban VARCHAR(100), bank_swift_code VARCHAR(5) );
  • 33.
  • 34.
    Category Hierarchies Antipattern: column-per-level TABLEproduct (id, category_name, subcategory_name, name, price, ..) ----- TABLE category (id, name) TABLE product (id, category_id, subcategory_id, name, price, ...) Possible problems: ● To add or delete a level, we need to add or drop a column ● A subcategory can be erroneously linked to multiple categories ● A category can be erroneously used as subcategory, and vice versa
  • 35.
    Category Hierarchies A betterway: TABLE category (id, parent_id, name) TABLE product (id, category_id, name, price, ...) Possible problems: ● Circular dependencies (must be prevented at application level)
  • 36.
    Category Networks What ifevery category can have multiple parents? Antipattern: TABLE category (id, parent_id1, parent_id2, name)
  • 37.
    Category Graphs If everycategory can have multiple parents, correct pattern: TABLE category (id, name) TABLE category_relationship (parent_id, child_id)
  • 38.
    Antipattern: Parent List Ifevery category can have multiple parents, correct pattern: TABLE category (id, name, parent_list) INSERT INTO category (parent_list, name) VALUES ('sports/football/wear', 'football shoes'); ● This antipattern is sometimes used because it simplifies certain aspects ● But it overcomplicates other aspects ● Also, up to recently MySQL and MariaDB did not support recursive queries, but now they do
  • 39.
  • 40.
    Tags Column ● Supposeyou want to store user-typed tags for posts ● You may be tempted to: CREATE TABLE post ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, tags VARCHAR(200) ); INSERT INTO post (tags, ... ) VALUES (sunday,venus, ... );
  • 41.
    Tags Column ● Butwhat about this query? SELECT id FROM post WHERE tags LIKE '%sun%'; ● Mmm, maybe this is better: INSERT INTO post (tags, ... ) VALUES (',events,diversity,', ... ); SELECT id FROM post WHERE tags LIKE '%,sun,%'; However, this query cannot take advantage of indexes
  • 42.
    Tag Table CREATE TABLEpost ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, ... ); CREATE TABLE tag ( post_id INT UNSIGNED, tag VARCHAR(50), PRIMARY KEY (post_id, tag), INDEX (tag) ); It works. Queries will be able to use indexes.
  • 43.
    Tag Array -- PostgreSQL CREATETABLE post ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, tags TEXT[] ); CREATE INDEX idx_tags on post USING GIN (tags); -- MySQL CREATE TABLE post ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, tags JSON DEFAULT JSON_ARRAY(), INDEX idx_tags (tags) ); -- MariaDB can store JSON arrays, -- but since it cannot index them this solution is not viable
  • 44.
  • 45.
    Not So DifferentEntities ● Your DB has users, landlords and tenants ● Separate entities with different info ● But sometimes you treat them as one thing ● What to do?
  • 46.
    Inheritance ● In thesimplest case, they are just subclasses ● For example, landlords and tenants could be types of users ● Common properties are in the parent class -- relational way to represent it: TABLE user (id, first_name, last_name, email) TABLE landlord (id, user_id, vat_number) TABLE tenant (id, user_id, landlord_id) PostgreSQL allows to do this in a more object oriented way, with Table Inheritance
  • 47.
    Different Entities ● Butsometimes it’s better to consider them different entities ● Antipattern: Union View CREATE VIEW everyone AS (SELECT id, first_name, last_name FROM landlord) UNION (SELECT id, first_name, last_name FROM tenant) ; This makes some queries less verbose, at the cost of making them potentially very slow
  • 48.
    Unicity Across Tables/1 ● But maybe both landlords and tenants have emails, and we want to make sure they are UNIQUE ● Question: is there a practical reason?
  • 49.
    Unicity Across Tables/2 ● If it is necessary, you’re thinking about the problem in a wrong way ● If emails need be unique, they are a whole entity, so you’ll guarantee unicity on a single table TABLE landlord (id, first_name, last_name, vat_number) TABLE tenant (id, first_name, last_name, landlord_id) TABLE email (id, email UNIQUE, landlord_id, tenant_id) Bloody hell! The solution initially looks great, but linking emails to landlords or tenants in that way is horrific!
  • 50.
    Unicity Across Tables/2bis Why? ● Cannot build foreign keys (I don’t recommend it, but…) ● If in the future we want to link emails to suppliers, employees, etc, we’ll need to add columns to the table
  • 51.
    Unicity Across Tables/3 Even if we keep the landlord and tenant tables separated, we can create a superset called person. We decided it’s not a parent class, so it can just have an id column. Every landlord, tenant and email is linked to a person. TABLE landlord (id, person_id, first_name, last_name, vat_number) TABLE tenant (id, person_id, first_name, last_name, landlord_id) TABLE person (id) TABLE email (id, person_id, email UNIQUE)
  • 52.
  • 53.
    Catalog of Products Imaginewe have a catalogue of products where: ● Every product has certain common characteristics ● It’s important to be able to run queries on all products ○ SELECT id FROM p WHERE qty = 0; ○ SELECT MAX(price) FROM p GROUP BY vendor; ● Each product type also has a unique set of characteristics
  • 54.
    Antipattern: Stylesheet Table ●Keep all products in the same table ● Add a column for every characteristic that applies to at least one product ● Where a column doesn’t make sense, set to NULL Problems: ● Too many columns and indexes ○ Generally bad for query performance, especially INSERTs ○ Generally bad for operations (repair, backup, restore, ALTER TABLE…) ● Adding/removing a product type means to add/remove a set of columns ○ But in practice columns will hardly be removed and will remain unused ● NULL means both “no value for this product” and “doesn’t apply to this type of products”, leading to endless confusion
  • 55.
    Antipattern: Table perType ● Store products of different types in different tables Problems: ● Metadata become data ○ How to get the list of product types? ● Some queries become overcomplicated ○ Get the id’s of out of stock products ○ Most expensive product for each vendor
  • 56.
    Hybrid ● A singletable for characteristics common to all product types ● A separate table per product type, for non-common characteristics Problems: ● Many JOINs ● Adding/removing product types means to add/remove tables
  • 57.
    Semi-Structured Data ● Asingle table for all products ● A regular column for each column common to all product types ● A semi-structured column for all type-specific characteristics ○ JSON, HStore… ○ Not arrays ○ Not CSV ● Proper indexes on unstructured data (depending on your technology) Problems: ● Still a big table ● Queries on semi-structured data may be complicated and not supported by ORMs
  • 58.
    Antipattern: Entity,Attribute,Value TABLE entity(id, name) TABLE attribute (id, entity_id, name) TABLE value (id, attribute_id, value) ● Each product type is an entity ● Each type characteristics are stored in attribute ● Each product is a set of values Example: Entity { id: 24, name: "Bed" } Attribute [ { id: 123, entity_id: 24, name: "material" }, ... ] Value [ { id: 999, attribute_id: 123, value: "wood" } ]
  • 59.
    Antipattern: Entity,Attribute,Value Problems: ● WeJOIN 3 tables every time we want to get a single value! ● All values must be treated as texts ○ Unless we create multiple value tables: int_value, text_value... ○ Which means, even more JOINs
  • 60.
  • 61.
    Names Beyond Comprehension ●I saw the following table names in production: ○ marco2015 # Marco was the table’s creation ○ jan2015 # jan was the month ○ tmp_tmp_tmp_fix ○ tmp_fix_fix_fix # Because symmetry is cool I forgot many other examples because... “Ultimate horror often paralyses memory in a merciful way.” ― H.P. Lovecraft
  • 62.
    Data in Metadata ●Include data in table names ○ invoice_2020, invoice_2019, invoice_2018… ● User a year column instead ● If the table is too big, there are other ways to contain the problem (partitioning)
  • 63.
    Bad Names inGeneral ● A names should tell everyone what a table or column is ○ Even to new hires! ○ Even to you… in 5 years from now! ● Otherwise people have to look at other documentation sources ○ ….which typically don’t exist ● Names should follow a standard across all company databases ○ singular/plural, long/short names, ... ● So people don’t have to check how a table / column is called exactly
  • 64.
    Thank you forlistening! federico-razzoli.com/services Telegram channel: open_source_databases