Three steps to untangle data traffic jams

Three steps to Untangle
Data Traffic Jams

• Oscar Westra van Holthe – Kind
• Software developer with bol.com since 2012
• You may know me from:
• Connecting retailer via SDD
• Retailer invoicing
• Topspin
• Measurements 2.0
About me

Prerequisites
• Know your data
• Know what you do with it
• Basic understanding of SQL

The problem: data traffic jam
Symptoms:
• Your queries are slow
• Your DB connection times out on the query
• …

Relational Databases
• For OLTP, nothing beats a relational database
• PostgreSQL query planner does an awesome job
• The root cause is always that the DB is doing too much

Tool: the EXPLAIN command
• Tells you how the database will execute your query
• Tells you the associated costs
cost=42.78..812.82 rows=16532 width=32
• start-up cost (the time before the output can begin)
• total cost, assuming the plan node is run to completion,
i.e. there's no limit clause or similar.
• est. number of rows output by this plan node
• average width of rows output by this plan node (in bytes).
• Costs have arbitrary scale, but lower is better

Execution plan: the menu
• Scans:
• Sequential (full table access)
• Index (range)
• Index (range) only
• Joins
• Hash (reads smaller table first, looks up in larger)
• Loops (reads larger table first, looks up in smaller)
• Bitmap (reads index first, then does lookups)
• Merge (zips two large tables after sorting them on join key)
• Sorting
• Mergesort (disk), Quicksort (memory), Heapsort (memory, limit), None

What if EXPLAIN tells you…
Hash Join (cost=22896.89..54208.53 rows=330801 width=1239)
Hash Cond: (order_line.ol_o_id = oorder.o_id)
-> Nested Loop (cost=8853.68..27149.42 rows=32734 width=542)
-> Seq Scan on warehouse (cost=0.00..1.01 rows=1 width=85)
Filter: (w_id = 1)
-> Merge Join (cost=8853.68..26821.07 rows=32734 width=457)
Merge Cond: (order_line.ol_i_id = item.i_id)
Merge Cond: (stock.s_i_id = order_line.ol_i_id)
-> Index Scan using pk_stock on stock (cost=0.00..12910.70 rows=100000 width=315)
Index Cond: (s_w_id = 1)
-> Materialize (cost=8852.63..9261.81 rows=32734 width=70)
-> Sort (cost=8852.63..8934.47 rows=32734 width=70)
Sort Key: order_line.ol_i_id
-> Bitmap Heap Scan on order_line (cost=843.82..5053.83 rows=32734 width=70)
Recheck Cond: ((ol_w_id = 1) AND (ol_d_id = 1))
-> Bitmap Index Scan on pk_order_line (cost=0.00..835.64 rows=32734 width=0)
Index Cond: ((ol_w_id = 1) AND (ol_d_id = 1))
-> Index Scan using pk_item on item (cost=0.00..3659.26 rows=100000 width=72)
-> Hash (cost=11040.12..11040.12 rows=29767 width=697)
-> Hash Join (cost=3743.15..11040.12 rows=29767 width=697)
Hash Cond: (oorder.o_d_id = district.d_id)
Merge Cond: ((customer.c_d_id = oorder.o_d_id) AND (customer.c_id = oorder.o_c_id))
-> Index Scan using pk_customer on customer (cost=0.00..6215.00 rows=30000 width=564)
Index Cond: (c_w_id = 1)
-> Materialize (cost=3741.90..4116.90 rows=30000 width=42)
-> Sort (cost=3741.90..3816.90 rows=30000 width=42)
Sort Key: oorder.o_d_id, oorder.o_c_id
-> Seq Scan on oorder (cost=0.00..636.00 rows=30000 width=42)
Filter: (o_w_id = 1)
-> Hash (cost=1.12..1.12 rows=10 width=91)
-> Seq Scan on district (cost=0.00..1.12 rows=10 width=91)
Filter: (d_w_id = 1)

Three steps to improve
1. Know your data
2. Know your use cases
3. Tune the query

Know your data
• Design for reading
• Identify immutable data

Know your use cases
• Filter on a minimal number of tables
• Denormalize immutable data to reduce joins
• Make data immutable if appropriate
(and change your use cases accordingly)

Tune your queries
• Create index for every query/table pair
• Sort columns on type of use:
• Filters (where + join) first
• Then group by
• Last order by (if appropriate)
• Notes:
• Smaller indices perform better
(sometimes you actually need near-duplicate indices)
• Too many indices degrade (write) performance
à limit the number of queries if you can

• The best way to optimize queries is to do less
• Know your data & use cases
Takeaway

Resources
• PostgreSQL documentation on “Using EXPLAIN”
https://www.postgresql.org/docs/current/static/using-
explain.html
• In-depth explanation of a single execution plan:
https://robots.thoughtbot.com/reading-an-explain-analyze-
query-plan

Thanks
till next bol.com
Oscar Westra van Holthe - Kind
owestra@bol.com

What if that’s not enough?
• You’ve optimized your use cases
• You’ve optimized your data for read performance
• You’ve optimized your queries
• And it’s still not good enough…
• Then it’s time for a 70’s-era mainframe big data solution
à but that fundamentally changes your use cases!
• Not querying when you need it, but:
• Batching / asynchronous à run query ahead of using its results
• Streaming à query continuously

Three steps to untangle data traffic jams

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Three steps to untangle data traffic jams

Similar to Three steps to untangle data traffic jams (20)

More from Bol.com Techlab

More from Bol.com Techlab (20)

Recently uploaded

Recently uploaded (20)

Three steps to untangle data traffic jams