2. case 1: Bad release. Overview.
- Symptoms:
- significant load increase,
- slowing operations.
- Often unpredictable:
- we don't know where the problem occurs.
- Emergency:
- problem must be found and resolved ASAP.
3. case 1: Bad release. Troubleshoot.
- Outside the database
- top, sysstat, etc...
- nagios/zabbix/...
- Inside the database
- pgbadger/loganalyze/etc...
- pg_stat_statements
4. case 1: Bad release. Outside the database.
- top:
- cpu usage, load average, swapping, iowait.
- sysstat:
- disk utilization (iostat),
- resource consumption (sar).
5. case 1: Bad release. Outside the database.
- Okmeter:
- online monitoring service,
- rich feature/plugin set,
- postgresql good support.
6. case 1: Bad release. Inside the database.
- Log analyze (pgBadger)
- huge logs
- read log before report creating
- a lot of time need tobuild report
- pg_stat_statements (contrib)
- small storage footprint,
- quick and flexible reports.
7. case 1: Bad release. Inside the database.
- query_stat_total.sql
- https://github.com/PostgreSQL-Consulting/pg-
utils
8. case 1: Bad release. Query #1.
SELECT
p.id,
p.rating
FROM posts p
LEFT JOIN complaints com ON (com.post_id = p.id AND com.user_id= ?)
WHERE p.is_deleted IS FALSE
AND com.is_hide IS NOT TRUE
AND p.type_id != ?
ORDER BY p.rating DESC LIMIT ?;
9. case 1: Bad release. Query #1. JOIN -> (NOT) EXISTS
SELECT
p.id,
p.rating
FROM posts p
WHERE p.is_deleted IS FALSE
AND p.type_id != ?
AND NOT EXISTS (SELECT 1 FROM complaints com
WHERE com.post_id = p.id
AND user_id = ? AND is_hide = true)
ORDER BY p.rating DESC LIMIT ?;
10. case 1: Bad release. Query #2.
SELECT * FROM tags WHERE (tags.title ilike ?)
Trigram Index.
CREATE INDEX tags_title_trigram_key on tags using gin(title gin_trgm_ops);
11. case 1: Bad release. Query #3.
SELECT post.*
FROM post
JOIN domain ON post.domain_id = domain.id
LEFT OUTER JOIN domain_acl
ON domain_acl.domain_id = domain.id
AND domain_acl.user_id = ?
WHERE post.deleted = ?
AND post.domain_id IN (?, ?, ?, ?, ?, ?, ?, ?)
AND ((domain.flags & ?) = ?
OR (domain_acl.acl & ?) = ?)
AND post.id NOT IN (?, ?)
ORDER BY post.last_activity DESC LIMIT ? OFFSET ?
12. case 1: Bad release. Query #3. Index Only Scan
SELECT *
FROM post
WHERE id IN (SELECT post.id
FROM post
JOIN domain ON post.domain_id = domain.id
LEFT OUTER JOIN domain_acl
ON domain_acl.domain_id = domain.id
AND domain_acl.user_id = ?
WHERE post.deleted = ?
AND post.domain_id IN (?, ?, ?, ?, ?, ?, ?, ?)
AND ((domain.flags & ?) = ? OR (domain_acl.acl & ?) = ?)
AND post.id NOT IN (?, ?)
ORDER BY post.last_activity DESC LIMIT ? OFFSET ?)
order by post.last_activity DESC
CREATE INDEX post_domain_id_last_activity_id_deleted_partial
ON post USING btree (domain_id, last_activity, id, deleted) where deleted = 0;
13. case 1: Bad release. Query #4.
SELECT *
FROM "group"
WHERE ("group".group_vislvl_content >= ?)
AND (group_main_domain_id IS NULL OR group_main_domain_id IN(?,?))
AND ("group".obj_pics_count +
"group".group_persons_count +
"group".group_blog_posts_count +
"group".group_wiki_count >= ?)
AND "group".group_is_demo = ?
AND "group".obj_status_did = ?
ORDER BY "group".group_persons_count desc, "group".obj_created asc
LIMIT ?;
14. case 1: Bad release. Query #4. Partial Index
CREATE INDEX group_special2_key
ON "group" USING btree (group_persons_count DESC, obj_created)
WHERE ("group".obj_pics_count + "group".group_persons_count +
"group".group_blog_posts_count + "group".group_wiki_count >= 1);
15. case 2: More app servers... We need more...
- project grow
- load increasing
- add more app servers
- more apps -> more db connections
16. case 2: More app servers... We need more...
- too much db connections are bad
- high resource contention
- os overhead (memory, locks, forks)
17. case 2: More app servers... We need more...
- pgbouncer
- lightweight connection pooler
- stable, simple, fast (libevent)
- use pgbouncer between apps and database
18. case 2: More app servers... We need more...
- simple test: without pgbouncer
- pgbench -C -c 32 -T 300 -U postgres shopdb
transaction type: TPC-B (sort of), scaling factor: 128, query mode: simple,
number of clients: 32, number of threads: 1, duration: 300 s
number of transactions actually processed: 253628
latency average: 37.851 ms
tps = 845.403711 (including connections establishing)
tps = 15320.442789 (excluding connections establishing)
19. case 2: More app servers... We need more...
- simple test: with pgbouncer
- pgbench -C -c 32 -T 300 -U postgres shopdb
transaction type: TPC-B (sort of), scaling factor: 128, query mode: simple,
number of clients: 32, number of threads: 1, duration: 300 s
number of transactions actually processed: 2689931
latency average: 3.569 ms
tps = 8966.389025 (including connections establishing)
tps = 19225.431659 (excluding connections establishing)
20. case 2: More app servers... We need more...
- total: 300 seconds with 32 clients on 8-core server
- latency: 37.8ms vs. 3.5ms
- total transactions: 253628 vs. 2689931
- tps: 15320 vs. 19225