MySQL Performance Optimization Part I Abhijit Mondal Software Engineer at HolidayIQ
ContentsInnoDB or MyISAM or is there something better ?Choosing optimal data typesNormalization vs. DenormalizationCache and Summary TablesExplaining “EXPLAIN”
InnoDB vs. MyISAM● “You should use InnoDB for your tables unless you have a compelling need to use a different engine” - High Performance MySQL by Peter Zaitsev● InnoDB : Pros- 1. Row based locking mechanism, enabling scaling of insert and update queries. Whole table is not locked when one client is writing to selected rows, only those rows are locked (and any gaps in between- “phantom rows”). 2. Clustering by primary key for faster lookups and ordering. 3. High concurrency. 4. Transactional, crash-safe, better online backup capability. 5. Adaptive Hash index construction from B Tree indexes for faster lookups from main memory. Cons- 1. Slower writes ( insert, update queries). 2. Slower BLOB handling. 3. COUNT(*) queries require full table scans.
InnoDB vs. MyISAM● MyISAM : Pros- 1. Faster reads and writes for small to medium sized tables. 2. COUNT(*) queries are fast. Separate field that keeps track of number of columns. 3. Better for FULL Text Searching. But InnoDB can now use Sphinx for Full Text searching. Cons- 1. Non transactional, Data loss issues during crashes. 2. Table level locking. Entire table locked in the event of read and write, but can insert rows while select query is being processed. 3. Insert and update queries are not scalable, concurrency issues.● Memory Engine for Temporary tables : Hash indexes for faster select queries from temporary tables . All data stored in memory. Data lost after server restart. Example usage – Mapping cities/attractions to region/countries, Caching data, Temporary summary tables for joins.
Choosing the optimal data type● Always choose the smallest data type that is large enough for the largest value that it is representing. Smaller data type takes up lesser space in memory and CPU cache.● Given an option for integer or character, should choose integer because due to character sets and sorting rules character comparisons are complicated.● Unless the requirement for storing NULL value inside a field, always choose NOT NULL. Null values makes index construction, index stats and value comparisons more complicated. They also require more space. When a nullable is indexed it requires and extra byte per entry. InnoDB handles NULL better (only single bit) than MyISAM.● TIMESTAMP vs. DATETIME: TIMESTAMP takes half as much space (4 bytes) as DATETIME (8 bytes) and also has auto updating feature.● Using UNSIGNED integer types for AUTO_INCREMENT primary key fields (unless negative integers are explicitly required) . For storing cities in India (around 1000 cities) use UNSIGNED SMALLINT, it takes values from 0 to 65535 enough to hold all the cities in India. Whereas INT (as per current implementation) would use 32 bits compared to SMALLINT(16 bits).
Choosing the optimal data type● VARCHAR vs. CHAR : VARCHAR is variable length data type while CHAR is fixed length. For shorter strings VARCHAR saves space but when updated rows may grow or shrink depending on the update value. VARCHAR uses 1 byte extra to store the length of the value if length is less than 255 bytes else it use 2 additional bytes. Also VARCHAR is suitable for columns which are not updated frequently as this requires dynamic size adjustment everytime a value is updated.● VARCHAR is suitable for storing city/state/country/region/attraction names as these values are not updated much. Whereas CHAR may be suitable for storing MD5 passwords (fixed length) or user names/activities ( updated and inserted frequently ).
Choosing the optimal data type● ENUM : For storing strings values which have fixed and small sample space use ENUM data type. Eg. Gender (M or F), is_active (1 or 0), Day of week etc. Create table activity(id primary key not null auto_increment, activity varchar(20), day_of_week enum(sun,mon,tue,wed,thu,fri,sat)); ENUM values are stored as integers (TINYINT) in table hence comparisons are faster and takes less space. But joins between ENUM and VARCHAR or CHAR is less efficient as ENUM needs to be converted into one of those types first then comparison is done.● BLOB and TEXT fields cannot be indexed.● Using SET to combine many true/false values into single column. Create table test1(perms set(can_read,can_write,can_delete)); Insert into test1(perms) values(can_read,can_delete); Select perms from test1 where find_in_set(can_delete,perms);● Identifier for table joins should be of the same data type to improve performance by reducing type conversion. Select count(*) from destination join attractions using(destinationid, active, countryid);
Normalization vs. Denormalization● Normalization : Pros : 1. Normalized updates are usually faster than denormalized updates. 2. No duplicate data so there is less data to change. 3. Tables are usually smaller so they fit better in memory and perform better. 4. Lack of redundant data means less need for GROUP BY or DISTINCT queries. Cons : 1. JOINS required to retrieve values from Normalized tables. This is usually expensive and would have benefitted with indexing on a denormalized table.● Eg. Find the users and their reviews such that review given between 4th March and 30th June and order by users age. Expensive join required. SELECT u.user_name, r.review from user u join review r using(user_id) where r.date_reviewed between (2012-03-04,2012-06-30) order by u.age limit 100;
Normalization vs. Denormalization● Denormalization : Pros : 1. No JOINS required for denormalized data. Full table scan without indexing is still faster than joins that doesnt fit into memory. Cons : 1. Duplicate data issue arises. Denormalized table has large rows that are almost same except for one single column. Happens for many-to-many relation mapping in a single table. 2. Inconsistencies in data during updates may arise and updates are expensive.● Eg. Find the users and their reviews such that review given between 4th March and 30th June and order by users age. Index on (date_reviewed, age) will greatly increase the performance of this query. SELECT user_name, review from user_review where date_reviewed between (2012-03-04,2012-06-30) order by age limit 100;
Normalization vs. Denormalization● When same tables are joined frequently in queries it is better to denormalize one of the table by duplicating data from the other table. Insert, updates and deletes can be made consistent by creating triggers on one of them. For eg. In the case of user and reviews table, copy review,review_id and date_reviewed from reviews to user. Then create triggers for insert, update and delete on reviews table.● DELIMITER # CREATE TRIGGER `after_insert_in_reviews` after insert on reviews FOR EACH ROW BEGIN INSERT INTO user(user_id, review, date_reviewed) values(NEW.user_id, review,NOW()); END# DELIMITER ;● DELIMITER # CREATE TRIGGER `after_delete_in_reviews` after delete on reviews FOR EACH ROW BEGIN DELETE FROM user where review_id=OLD.review_id; END# DELIMITER ;
Summary and Cache Tables● Consider the situations: 1. There are 3 tables for user, reviews and destination. We want to analyze the number of reviews for all destinations in a particular city grouped by destination and in a particular user age range and in another case grouped by user gender. So we write the two queries as:● SELECT destination.destname, count(review.review_id) as review_count from user join reviews join destination where user.age between 20 and 30 and user.userid=reviews.user_id and reviews.destination_id=destination.destid and destination.city=Bangalore group by destination.destid;● SELECT user.gender, count(review.review_id) as review_count from user join reviews join destination where user.age between 20 and 30 and user.userid=reviews.user_id and reviews.destination_id=destination.destid and destination.city=Bangalore group by user.gender;● Instead of doing expensive joins on 3 large tables everytime where on the summary of data differs, we can create a summary table and update it periodically using a cronjob.
Summary and Cache Tables● CREATE table user_rev_dest_summary SELECT * from user join reviews join destination where user.userid=reviews.user_id and reviews.destination_id=destination.destid;● ALTER table user_rev_dest_summary add index city_index(age, city);● SELECT dest_name,count(review_id) as review_count from user_rev_dest_summary where age between 20 and 30 and city=Bangalore group by destid;● SELECT gender,count(review_id) as review_count from user_rev_dest_summary where age between 20 and 30 and city=Bangalore group by gender;● Using summary table our query performance has greatly improved but if user,destination or review tables are updated frequently our summary data may become stale. So need to decide at what interval to update the summary table.
Explaining “Explain”● EXPLAIN output columns : Important columns are type, possible_keys, key, rows and Extra.● EXPLAIN extended select dest.`Destination_name`, attr.`attractionid`, attr.`attractionname` from destination as dest,attractions as attr,hotels_by_locality as hl where dest.`Destination_id`=attr.`destinationid` and dest.`CountryID`=1 and dest.`other_destination`=0 and attr.`active`=1 and hl.`typeid`=attr.`attractionid`;● Types of “type” : From best to worse 1. const - The table has at most one matching row, which is read at the start of the query. const tables are very fast because they are read only once. const is used when you compare all parts of a PRIMARY KEY or UNIQUE index to constant values. SELECT * FROM attractions WHERE attraction_id=8385;
Explaining “Explain”● Types of “type” : contd. 2. eq_ref - One row is read from this table for each combination of rows from the previous tables. It is used when all parts of an index are used by the join and the index is a PRIMARY KEY or UNIQUE NOT NULL index. SELECT * from resort join city using(CityID); (CityId is primary key of city). 3. ref - All rows with matching index values are read from this table for each combination of rows from the previous tables. SELECT * from resort join city using(StateID); (index on city.StateId but many rows in city having same state id). 4. fulltext - The join is performed using a FULLTEXT index. 5. range - Only rows that are in a given range are retrieved, using an index to select the rows. The key column in the output row indicates which index is used. SELECT * from reviews where date_reviewed between 2012-06-30 and 2012-08-07; (p.s. Index on date_reviewed doesnt work with DATE() functions).
Explaining “Explain”● Types of “type” : contd. 6. index - This join type is the same as ALL, except that only the index tree is scanned. This usually is faster than ALL because the index file usually is smaller than the data file. SELECT StateID from resort; (covering index on StateID). 7. ALL - A full table scan is done for each combination of rows from the previous tables. Avoid this by adding index to the appropriate table.● The common “Extra” s : 1. Using filesort - MySQL must do an extra pass to find out how to retrieve the rows in sorted order. The sort is done by going through all rows according to the join type and storing the sort key and pointer to the row for all rows that match the WHERE clause. SELECT resort.Location from resort order by resort.StateID; (no index on Location).
Explaining “Explain”● The common “Extra” s : contd. 2. Using index - The column information is retrieved from the table using only information in the index tree without having to do an additional seek to read the actual row. This strategy can be used when the query uses only columns that are part of a single index. (covering indexes). SELECT resort.StateID from resort order by resort.Destination_id; (index on Destination_id, StateID, StateID picked from index only after sorting by Destination_id ). 3. Using temporary - To resolve the query, MySQL needs to create a temporary table to hold the result. This typically happens if the query contains GROUP BY and ORDER BY clauses that list columns differently. 4. Using where - A WHERE clause is used to restrict which rows to match against the next table or send to the client. Even if you are using an index for all parts of a WHERE clause, you may see Using where if the column can be NULL. SELECT resort.Location from resort where StateID!=NULL;
References● High Performance MySQL by Baron Schwartz, Peter Zaitsev and Vadim Tkachenko.● http://net.tutsplus.com/tutorials/other/top-20-mysql-best-practices/● http://www.mysqlperformanceblog.com/2009/01/12/should-you-move-from-myisam-to-in● http://www.techrepublic.com/blog/10things/10-ways-to-screw-up-your-database-design/18 Thank You