The document provides three steps to optimize database queries that are running slowly:
1. Know your data structure and how it will be queried
2. Understand your different use cases and filter/structure data accordingly
3. Use the EXPLAIN command and indexes to tune queries by reducing joins, sorting data optimally, and limiting the number of queries
2. • Oscar Westra van Holthe – Kind
• Software developer with bol.com since 2012
• You may know me from:
• Connecting retailer via SDD
• Retailer invoicing
• Topspin
• Measurements 2.0
About me
4. The problem: data traffic jam
Symptoms:
• Your queries are slow
• Your DB connection times out on the query
• …
5. Relational Databases
• For OLTP, nothing beats a relational database
• PostgreSQL query planner does an awesome job
• The root cause is always that the DB is doing too much
6. Tool: the EXPLAIN command
• Tells you how the database will execute your query
• Tells you the associated costs
cost=42.78..812.82 rows=16532 width=32
• start-up cost (the time before the output can begin)
• total cost, assuming the plan node is run to completion,
i.e. there's no limit clause or similar.
• est. number of rows output by this plan node
• average width of rows output by this plan node (in bytes).
• Costs have arbitrary scale, but lower is better
7. Execution plan: the menu
• Scans:
• Sequential (full table access)
• Index (range)
• Index (range) only
• Joins
• Hash (reads smaller table first, looks up in larger)
• Loops (reads larger table first, looks up in smaller)
• Bitmap (reads index first, then does lookups)
• Merge (zips two large tables after sorting them on join key)
• Sorting
• Mergesort (disk), Quicksort (memory), Heapsort (memory, limit), None
8. What if EXPLAIN tells you…
Hash Join (cost=22896.89..54208.53 rows=330801 width=1239)
Hash Cond: (order_line.ol_o_id = oorder.o_id)
-> Nested Loop (cost=8853.68..27149.42 rows=32734 width=542)
-> Seq Scan on warehouse (cost=0.00..1.01 rows=1 width=85)
Filter: (w_id = 1)
-> Merge Join (cost=8853.68..26821.07 rows=32734 width=457)
Merge Cond: (order_line.ol_i_id = item.i_id)
-> Merge Join (cost=8852.66..22503.03 rows=32734 width=385)
Merge Cond: (stock.s_i_id = order_line.ol_i_id)
-> Index Scan using pk_stock on stock (cost=0.00..12910.70 rows=100000 width=315)
Index Cond: (s_w_id = 1)
-> Materialize (cost=8852.63..9261.81 rows=32734 width=70)
-> Sort (cost=8852.63..8934.47 rows=32734 width=70)
Sort Key: order_line.ol_i_id
-> Bitmap Heap Scan on order_line (cost=843.82..5053.83 rows=32734 width=70)
Recheck Cond: ((ol_w_id = 1) AND (ol_d_id = 1))
-> Bitmap Index Scan on pk_order_line (cost=0.00..835.64 rows=32734 width=0)
Index Cond: ((ol_w_id = 1) AND (ol_d_id = 1))
-> Index Scan using pk_item on item (cost=0.00..3659.26 rows=100000 width=72)
-> Hash (cost=11040.12..11040.12 rows=29767 width=697)
-> Hash Join (cost=3743.15..11040.12 rows=29767 width=697)
Hash Cond: (oorder.o_d_id = district.d_id)
-> Merge Join (cost=3741.90..10629.58 rows=29767 width=606)
Merge Cond: ((customer.c_d_id = oorder.o_d_id) AND (customer.c_id = oorder.o_c_id))
-> Index Scan using pk_customer on customer (cost=0.00..6215.00 rows=30000 width=564)
Index Cond: (c_w_id = 1)
-> Materialize (cost=3741.90..4116.90 rows=30000 width=42)
-> Sort (cost=3741.90..3816.90 rows=30000 width=42)
Sort Key: oorder.o_d_id, oorder.o_c_id
-> Seq Scan on oorder (cost=0.00..636.00 rows=30000 width=42)
Filter: (o_w_id = 1)
-> Hash (cost=1.12..1.12 rows=10 width=91)
-> Seq Scan on district (cost=0.00..1.12 rows=10 width=91)
Filter: (d_w_id = 1)
9. Three steps to improve
1. Know your data
2. Know your use cases
3. Tune the query
11. Know your use cases
• Filter on a minimal number of tables
• Denormalize immutable data to reduce joins
• Make data immutable if appropriate
(and change your use cases accordingly)
12. Tune your queries
• Create index for every query/table pair
• Sort columns on type of use:
• Filters (where + join) first
• Then group by
• Last order by (if appropriate)
• Notes:
• Smaller indices perform better
(sometimes you actually need near-duplicate indices)
• Too many indices degrade (write) performance
à limit the number of queries if you can
13. • The best way to optimize queries is to do less
• Know your data & use cases
Takeaway
14. Resources
• PostgreSQL documentation on “Using EXPLAIN”
https://www.postgresql.org/docs/current/static/using-
explain.html
• In-depth explanation of a single execution plan:
https://robots.thoughtbot.com/reading-an-explain-analyze-
query-plan
16. What if that’s not enough?
• You’ve optimized your use cases
• You’ve optimized your data for read performance
• You’ve optimized your queries
• And it’s still not good enough…
• Then it’s time for a 70’s-era mainframe big data solution
à but that fundamentally changes your use cases!
• Not querying when you need it, but:
• Batching / asynchronous à run query ahead of using its results
• Streaming à query continuously