2. Agenda
• Introduction
• Sample Architecture
• The optimizer and execution plans
• Examples of single table processing
• Examples of Join processing
3. Scaling Databases
• Scaling – expending a system to support more data / sessions.
• Best scalability – linear, predictable.
• Scale-up (bigger server) vs. Scale-out (more servers)
• Scaling up – easier, but limited, expensive
• Most common scale-out strategy – Sharding
• Spreading the data (rows in a table) across many independent nodes
• Each node has a different subset of the data – Shared Nothing
• Processing sharded data across shared nothing cluster is also called
Massive Parallel Processing (MPP)
• MPP databases appeared since the 80s (ex: Teradata), became popular in
the analytic space in the 2000s (ex: Netezza, Greenplum, Vertica)
• Open source examples over Hadoop – Hive(*), Impala
4. Sample MPP database architecture
SQL
Client
5. Results
Master Node
Holds Data Dictionary,
Sessions, Optimizer
4. Results
Shard 002
Shard 003
Shard 004
Shard 005
…
Shard nnn
Each table – distributed across all shards
1. SQL
2. Execution
Plan
3. Parallel SQL Execution
Shard 001
5. Processing – Analytical vs.
Operational
• With MongoDB – most operations involve a single document
• With SQL – most operations involve processing many rows, likely
across all shards
• Example, sum of sales per day per store
• Also, SQL is more expressive – it has a rich set of complex operations
(joining, aggregating, sorting etc)
• A database optimizer builds an execution plan:
• The access path per table (full scan, index scan etc)
• The order of the joins
• The type of each join (multiple algorithms)
6. Execution Plan- Sample Table
• Syntax and execution plans are based on Greenplum – but the lessons are general.
• We’ll start with a simple, single table, with no indexes.
• Hold data for calls
• CREATE TABLE calls
(subscriber_id integer,
call_date
date,
call_length
integer)
DISTRIBUTED BY (subscriber_id);
• We can control the sharding key (distribution key) – will allow later
some join optimizations.
• Row Placement: shard number = hash(subscriber_id, # of shards)
• Generally, we want data to be spread equally across all shards (no skew)
7. Single Table Execution - Plan 1
• EXPLAIN SELECT * FROM calls
WHERE call_date BETWEEN '2013/11/01' AND '2013/11/30';
•
QUERY PLAN
-------------------------------------------------Gather Motion n:1
-> Seq Scan on calls
Filter: call_date >= '2013-11-01'::date AND
call_date <= '2013-11-30'::date
• Sequential Scan – a full scan of each table shard
• Filter – applied during the scan
• Gather Motion – moving the result set of each shard to the master
8. Single Table Execution - Plan 2
• EXPLAIN SELECT call_date, count(*)
FROM calls
WHERE call_length <= 60
GROUP BY call_date;
• Challenge – do the group by in parallel
• General case - could be millions or billions of groups
• Challenge – the rows for each group are distributed across all shards
• Conclusion – the processes in the shards need to communicate
9. Single Table Execution - Plan 2
Send Final Results to the Master
Process Group 2 - Final Aggregation of each group
Shard nnn
Shard 009
Shard 008
Shard 007
Shard 006
Shard 005
Shard 004
Shard 003
Shard 002
Shard 001
Re-distributing (streaming) the result set over the cluster network (n:n)
Process Group 1 - Local Scan, Filter and Aggregation
…
10. Single Table Execution - Plan 2
• QUERY PLAN
-------------------------------------------------Gather Motion n:1
-> HashAggregate
Group By: calls.call_date
-> Redistribute Motion n:n
Hash Key: calls.call_date
-> HashAggregate
Group By: calls.call_date
-> Seq Scan on calls
Filter: call_length <= 60
• HashAggregate – does aggregation algorithm per group
• Redistribute Motion – redistribute the data across the shards to a new set of
processes
• Send each row in the result set to shard number = hash(call_date, # of shards)
11. Single Table Execution - Plan 3
• EXPLAIN SELECT call_date, count(*)
FROM calls
WHERE call_length <= 60
GROUP BY call_date
ORDER BY call_date;
• QUERY PLAN
-------------------------------------------------Gather Motion n:1
Merge Key: call_date
-> Sort
Sort Key: partial_aggregation.call_date
-> HashAggregate
Group By: calls.call_date
-> Redistribute Motion n:n
Hash Key: calls.call_date
-> HashAggregate
Group By: calls.call_date
-> Seq Scan on calls
Filter: call_length <= 60
12. Execution Plan- A Second Table
• Let’s add a second table so we can have some joins.
• It holds details of each subscriber
• CREATE TABLE subscribers
(subscriber_id
integer,
subscriber_city_code integer)
DISTRIBUTED BY(subscriber_id);
• To start with, both tables have the same distribution key
• So, the all the rows of any specific subscriber, from both tables, will be hosted
in the same shard.
• We can leverage this knowledge in our algorithm
• Later we will see what happens if this is not the case
13. Simple Join 1 – Same Distribution Key
• EXPLAIN SELECT s.subscriber_id, s.subscriber_city_code,
c.call_date, c.call_length
FROM calls c JOIN subscribers s
ON(c.subscriber_id = s.subscriber_id)
WHERE s.subscriber_city_code = 4;
• QUERY PLAN
-------------------------------------------------Gather Motion n:1
-> Hash Join
Hash Cond: c.subscriber_id = s.subscriber_id
-> Seq Scan on calls c
-> Hash
-> Seq Scan on subscribers s
Filter: subscriber_city_code = 4
• Hash Join – joins two tables
• First table is processed, result set is hashed (based on the join key)
• Second table is scanned, joined to the first using hash lookups
14. Simple Join 2 – Same Distribution Key
• EXPLAIN SELECT c.call_date, s.subscriber_city_code,
count (*), sum(c.call_length)
FROM calls c JOIN subscribers s
ON (c.subscriber_id = s.subscriber_id)
WHERE s.subscriber_city_code IN (9,99,999)
AND call_date BETWEEN '2012/01/04' AND '2012/01/06‘
GROUP BY 1,2
ORDER BY c.call_date, sum(c.call_length) DESC;
• Nothing new – just a mix of all we’ve seen
15. Simple Join 2 – Same Distribution Key
QUERY PLAN
-------------------------------------------------Gather Motion n:1
Merge Key: call_date, sum
-> Sort
Sort Key: partial_aggregation.call_date, sum
-> HashAggregate
Group By: c.call_date, s.subscriber_city_code
-> Redistribute Motion n:n
Hash Key: c.call_date, s.subscriber_city_code
-> HashAggregate
Group By: c.call_date, s.subscriber_city_code
-> Hash Join
Hash Cond: c.subscriber_id = s.subscriber_id
-> Seq Scan on calls c
Filter: call_date >= '2012-01-04'::date AND
call_date <= '2012-01-06'::date
-> Hash
-> Seq Scan on subscribers s
Filter: subscriber_city_code =
ANY ('{9,99,999}'::integer[])
16. Simple Join 1 – Different Distribution
Key
• What if the subscriber table was distributed differently?
• ALTER TABLE subscribers
SET DISTRIBUTED BY(subscriber_city_code);
• Now our data about subscribers is mixed
• The list of customers in shard 1 in calls table is not the same as in subscriber table
• How to run Simple Join 1 query from before?
• Now, there has to be some shuffling of data over the network
• To minimize the work, it is better to shuffle the smaller table over the network
• Since the join key on calls table is the same as the distribution key (subscriber_id), we
can send each row from the result set of subscriber table directly to the right shard.
17. Simple Join 1 – Different Distribution
Key
• EXPLAIN SELECT s.subscriber_id, s.subscriber_city_code,
c.call_date, c.call_length
FROM calls c JOIN subscribers s
ON(c.subscriber_id = s.subscriber_id)
WHERE s.subscriber_city_code = 4;
• Same query as Simple Join 1!
• QUERY PLAN
-------------------------------------------------Gather Motion n:1
-> Hash Join
Hash Cond: c.subscriber_id = s.subscriber_id
-> Seq Scan on calls c
-> Hash
-> Redistribute Motion 1:n
Hash Key: s.subscriber_id
-> Seq Scan on subscribers s
Filter: subscriber_city_code = 4
18. Simple Join 2 – Different Distribution
Key
• EXPLAIN SELECT c.call_date, s.subscriber_city_code,
count (*), sum(c.call_length)
FROM calls c JOIN subscribers s
ON (c.subscriber_id = s.subscriber_id)
WHERE s.subscriber_city_code IN (9,99,999)
AND call_date BETWEEN '2012/01/04' AND '2012/01/06‘
GROUP BY 1,2
ORDER BY c.call_date, sum(c.call_length) DESC;
• Same query as Simple Join 2 – just different distribution
19. Simple Join 2 – Different Distribution
Key
QUERY PLAN
-------------------------------------------------Gather Motion n:1
Merge Key: call_date, sum
-> Sort
Sort Key: partial_aggregation.call_date, sum
-> HashAggregate
Group By: c.call_date, s.subscriber_city_code
-> Redistribute Motion n:n
Hash Key: c.call_date, s.subscriber_city_code
-> HashAggregate
Group By: c.call_date, s.subscriber_city_code
-> Hash Join
Hash Cond: c.subscriber_id = s.subscriber_id
-> Seq Scan on calls c
Filter: call_date >= '2012-01-04'::date AND
call_date <= '2012-01-06'::date
-> Hash
-> Redistribute Motion n:n
Hash Key: s.subscriber_id
-> Seq Scan on subscribers s
Filter: subscriber_city_code =
ANY ('{9,99,999}'::integer[])
20. Teasers
• EXPLAIN SELECT * FROM calls
ORDER BY call_length DESC
LIMIT 10;
(Easy - top 10 calls by length)
• EXPLAIN EXPLAIN SELECT call_date, count(*)
FROM calls WHERE call_length <= 60
GROUP BY call_date
HAVING count(*) >= 1000000
ORDER BY call_date;
(Easy – all days with at least a million short calls – HAVING clause)
• EXPLAIN SELECT call_date, count(distinct subscriber_id)
FROM calls GROUP BY call_date;
(Hard – per day, the number of subscribers with calls)
• EXPLAIN SELECT call_date,
count(distinct subscriber_id),
count(distinct call_length)
FROM calls GROUP BY call_date;
(Very Hard – two DISTINCT aggregations)