This document discusses various MySQL performance optimization techniques, including:
- Choosing between the InnoDB and MyISAM storage engines, with InnoDB generally recommended due to its transactional capabilities and row-level locking.
- Selecting optimal data types to minimize storage size and improve indexing and query performance.
- Considering whether to normalize or denormalize database schemas based on query patterns to reduce the need for joins or minimize data duplication respectively.
- Using summary/cache tables to pre-aggregate data and improve performance of analytical queries that involve expensive joins across multiple tables.
- Understanding the EXPLAIN output to analyze indexes used, table access methods, and ways to improve queries by adding appropriate indexes.
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
MySQL Performance Optimization
1. MySQL Performance
Optimization
Part I
Abhijit Mondal
Software Engineer at HolidayIQ
2. Contents
InnoDB or MyISAM or is there something better ?
Choosing optimal data types
Normalization vs. Denormalization
Cache and Summary Tables
Explaining “EXPLAIN”
3. InnoDB vs. MyISAM
● “You should use InnoDB for your tables unless you have a compelling need to use a
different engine” - High Performance MySQL by Peter Zaitsev
● InnoDB :
Pros-
1. Row based locking mechanism, enabling scaling of insert and update
queries. Whole table is not locked when one client is writing to selected rows, only
those rows are locked (and any gaps in between- “phantom rows”).
2. Clustering by primary key for faster lookups and ordering.
3. High concurrency.
4. Transactional, crash-safe, better online backup capability.
5. Adaptive Hash index construction from B Tree indexes for faster
lookups from main memory.
Cons-
1. Slower writes ( insert, update queries).
2. Slower BLOB handling.
3. COUNT(*) queries require full table scans.
4. InnoDB vs. MyISAM
● MyISAM :
Pros-
1. Faster reads and writes for small to medium sized tables.
2. COUNT(*) queries are fast. Separate field that keeps track of number of
columns.
3. Better for FULL Text Searching. But InnoDB can now use Sphinx for
Full Text searching.
Cons-
1. Non transactional, Data loss issues during crashes.
2. Table level locking. Entire table locked in the event of read and write,
but can insert rows while select query is being processed.
3. Insert and update queries are not scalable, concurrency issues.
● Memory Engine for Temporary tables : Hash indexes for faster select queries from
temporary tables . All data stored in memory. Data lost after server restart. Example
usage – Mapping cities/attractions to region/countries, Caching data, Temporary
summary tables for joins.
5. Choosing the optimal data type
● Always choose the smallest data type that is large enough for the largest value that it
is representing. Smaller data type takes up lesser space in memory and CPU cache.
● Given an option for integer or character, should choose integer because due to
character sets and sorting rules character comparisons are complicated.
● Unless the requirement for storing NULL value inside a field, always choose NOT
NULL. Null values makes index construction, index stats and value comparisons
more complicated. They also require more space. When a nullable is indexed it
requires and extra byte per entry. InnoDB handles NULL better (only single bit)
than MyISAM.
● TIMESTAMP vs. DATETIME: TIMESTAMP takes half as much space (4 bytes) as
DATETIME (8 bytes) and also has auto updating feature.
● Using UNSIGNED integer types for AUTO_INCREMENT primary key fields
(unless negative integers are explicitly required) . For storing cities in India (around
1000 cities) use UNSIGNED SMALLINT, it takes values from 0 to 65535 enough
to hold all the cities in India. Whereas INT (as per current implementation) would
use 32 bits compared to SMALLINT(16 bits).
6. Choosing the optimal data type
● VARCHAR vs. CHAR : VARCHAR is variable length data type while CHAR is
fixed length. For shorter strings VARCHAR saves space but when updated rows
may grow or shrink depending on the update value. VARCHAR uses 1 byte extra to
store the length of the value if length is less than 255 bytes else it use 2 additional
bytes. Also VARCHAR is suitable for columns which are not updated frequently as
this requires dynamic size adjustment everytime a value is updated.
● VARCHAR is suitable for storing city/state/country/region/attraction names as these
values are not updated much. Whereas CHAR may be suitable for storing MD5
passwords (fixed length) or user names/activities ( updated and inserted frequently ).
7. Choosing the optimal data type
● ENUM : For storing strings values which have fixed and small sample space use
ENUM data type. Eg. Gender (M or F), is_active (1 or 0), Day of week etc.
Create table activity(id primary key not null auto_increment, activity varchar(20),
day_of_week enum('sun','mon','tue','wed','thu','fri','sat'));
ENUM values are stored as integers (TINYINT) in table hence comparisons are
faster and takes less space.
But joins between ENUM and VARCHAR or CHAR is less efficient as ENUM
needs to be converted into one of those types first then comparison is done.
● BLOB and TEXT fields cannot be indexed.
● Using SET to combine many true/false values into single column.
Create table test1(perms set('can_read','can_write','can_delete'));
Insert into test1(perms) values('can_read','can_delete');
Select perms from test1 where find_in_set('can_delete',perms);
● Identifier for table joins should be of the same data type to improve performance by
reducing type conversion.
Select count(*) from destination join attractions using(destinationid, active,
countryid);
8. Normalization vs. Denormalization
● Normalization :
Pros :
1. Normalized updates are usually faster than denormalized updates.
2. No duplicate data so there is less data to change.
3. Tables are usually smaller so they fit better in memory and perform better.
4. Lack of redundant data means less need for GROUP BY or DISTINCT
queries.
Cons :
1. JOINS required to retrieve values from Normalized tables. This is usually
expensive and would have benefitted with indexing on a denormalized table.
● Eg. Find the users and their reviews such that review given between 4th March and
30th June and order by user's age. Expensive join required.
SELECT u.user_name, r.review from user u join review r using(user_id) where
r.date_reviewed between ('2012-03-04','2012-06-30') order by u.age limit 100;
9. Normalization vs. Denormalization
● Denormalization :
Pros :
1. No JOINS required for denormalized data. Full table scan without
indexing is still faster than joins that doesn't fit into memory.
Cons :
1. Duplicate data issue arises. Denormalized table has large rows that are
almost same except for one single column. Happens for many-to-many relation
mapping in a single table.
2. Inconsistencies in data during updates may arise and updates are
expensive.
● Eg. Find the users and their reviews such that review given between 4th March and
30th June and order by user's age. Index on (date_reviewed, age) will greatly
increase the performance of this query.
SELECT user_name, review from user_review where date_reviewed between
('2012-03-04','2012-06-30') order by age limit 100;
10. Normalization vs. Denormalization
● When same tables are joined frequently in queries it is better to denormalize one of
the table by duplicating data from the other table. Insert, updates and deletes can be
made consistent by creating triggers on one of them. For eg. In the case of user and
reviews table, copy review,review_id and date_reviewed from reviews to user. Then
create triggers for insert, update and delete on reviews table.
● DELIMITER #
CREATE TRIGGER `after_insert_in_reviews` after insert on reviews
FOR EACH ROW BEGIN
INSERT INTO user(user_id, review, date_reviewed) values(NEW.user_id,
review,NOW());
END#
DELIMITER ;
● DELIMITER #
CREATE TRIGGER `after_delete_in_reviews` after delete on reviews
FOR EACH ROW BEGIN
DELETE FROM user where review_id=OLD.review_id;
END#
DELIMITER ;
11. Summary and Cache Tables
● Consider the situations:
1. There are 3 tables for user, reviews and destination. We want to analyze the
number of reviews for all destinations in a particular city grouped by destination and
in a particular user age range and in another case grouped by user gender. So we
write the two queries as:
● SELECT destination.destname, count(review.review_id) as review_count from user
join reviews join destination where user.age between 20 and 30 and
user.userid=reviews.user_id and reviews.destination_id=destination.destid and
destination.city='Bangalore' group by destination.destid;
● SELECT user.gender, count(review.review_id) as review_count from user join
reviews join destination where user.age between 20 and 30 and
user.userid=reviews.user_id and reviews.destination_id=destination.destid and
destination.city='Bangalore' group by user.gender;
● Instead of doing expensive joins on 3 large tables everytime where on the summary
of data differs, we can create a summary table and update it periodically using a
cronjob.
12. Summary and Cache Tables
● CREATE table user_rev_dest_summary SELECT * from user join reviews join
destination where user.userid=reviews.user_id and
reviews.destination_id=destination.destid;
● ALTER table user_rev_dest_summary add index city_index(age, city);
● SELECT dest_name,count(review_id) as review_count from
user_rev_dest_summary where age between 20 and 30 and city='Bangalore' group
by destid;
● SELECT gender,count(review_id) as review_count from user_rev_dest_summary
where age between 20 and 30 and city='Bangalore' group by gender;
● Using summary table our query performance has greatly improved but if
user,destination or review tables are updated frequently our summary data may
become stale. So need to decide at what interval to update the summary table.
14. Explaining “Explain”
● EXPLAIN output columns : Important columns are type, possible_keys, key, rows
and Extra.
● EXPLAIN extended select dest.`Destination_name`, attr.`attractionid`,
attr.`attractionname` from destination as dest,attractions as attr,hotels_by_locality
as hl where dest.`Destination_id`=attr.`destinationid` and dest.`CountryID`='1' and
dest.`other_destination`='0' and attr.`active`='1' and hl.`typeid`=attr.`attractionid`;
● Types of “type” : From best to worse
1. const - The table has at most one matching row, which is read at the start of
the query. const tables are very fast because they are read only once.
const is used when you compare all parts of a PRIMARY KEY or UNIQUE index
to constant values.
SELECT * FROM attractions WHERE attraction_id=8385;
15. Explaining “Explain”
● Types of “type” : contd.
2. eq_ref - One row is read from this table for each combination of rows from
the previous tables. It is used when all parts of an index are used by the join and the
index is a PRIMARY KEY or UNIQUE NOT NULL index.
SELECT * from resort join city using(CityID); (CityId is primary key of city).
3. ref - All rows with matching index values are read from this table for each
combination of rows from the previous tables.
SELECT * from resort join city using(StateID); (index on city.StateId but many
rows in city having same state id).
4. fulltext - The join is performed using a FULLTEXT index.
5. range - Only rows that are in a given range are retrieved, using an index to
select the rows. The key column in the output row indicates which index is used.
SELECT * from reviews where date_reviewed between '2012-06-30' and
'2012-08-07'; (p.s. Index on date_reviewed doesn't work with DATE() functions).
16. Explaining “Explain”
● Types of “type” : contd.
6. index - This join type is the same as ALL, except that only the index tree is
scanned. This usually is faster than ALL because the index file usually is smaller
than the data file.
SELECT StateID from resort; (covering index on StateID).
7. ALL - A full table scan is done for each combination of rows from the
previous tables. Avoid this by adding index to the appropriate table.
● The common “Extra” 's :
1. Using filesort - MySQL must do an extra pass to find out how to retrieve the
rows in sorted order. The sort is done by going through all rows according to the
join type and storing the sort key and pointer to the row for all rows that match the
WHERE clause.
SELECT resort.Location from resort order by resort.StateID; (no index on
Location).
17. Explaining “Explain”
● The common “Extra” 's : contd.
2. Using index - The column information is retrieved from the table using only
information in the index tree without having to do an additional seek to read the
actual row. This strategy can be used when the query uses only columns that are
part of a single index. (covering indexes).
SELECT resort.StateID from resort order by resort.Destination_id; (index
on Destination_id, StateID, StateID picked from index only after sorting by
Destination_id ).
3. Using temporary - To resolve the query, MySQL needs to create a temporary
table to hold the result. This typically happens if the query contains GROUP BY
and ORDER BY clauses that list columns differently.
4. Using where - A WHERE clause is used to restrict which rows to match
against the next table or send to the client. Even if you are using an index for all
parts of a WHERE clause, you may see Using where if the column can be NULL.
SELECT resort.Location from resort where StateID!='NULL';
18. References
● High Performance MySQL by Baron Schwartz, Peter Zaitsev and Vadim
Tkachenko.
● http://net.tutsplus.com/tutorials/other/top-20-mysql-best-practices/
● http://www.mysqlperformanceblog.com/2009/01/12/should-you-move-from-myisam-to-in
● http://www.techrepublic.com/blog/10things/10-ways-to-screw-up-your-database-design/18
Thank You