This talk shows, how business intelligence can be performed with PostgreSQL. It features advanced SQL including aggregations, analytics as well as more sophisticated joins such as LATERAL.
In addition to that loading data as well as optimizing storage are discussed.
3. Scope of this training
Importing data
Simple aggregations
Windowing and analytics
Analyzing time series
Managing incomplete data
Writing custom aggregates
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
5. Loading data
Things to consider when importing data
There are many ways to import data
Avoid mini-transactions for performance reasons
In case of large data sets speed is a major issue
There is a life after importing
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
6. Importing a simple data set
A simple data structure . . .
test=# CREATE TABLE t_test (a int, b int);
CREATE TABLE
Let us add 10.000 rows now . . .
INSERT INTO t_test VALUES (1, 2);
...
INSERT INTO t_test VALUES (1, 2);
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
7. Using one transaction to import things
BEGIN;
INSERT INTO t_test VALUES (1, 2);
...
INSERT INTO t_test VALUES (1, 2);
COMMIT;
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
8. Observations
Performance can vary depending on hardware
Longer transactions can be WAYS faster
PostgreSQL has to flush every transaction to disk
Most of the time is burned by flushing
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
9. Changing durability requirements
Performance will sky rocket . . .
SET synchronous_commit TO off;
INSERT INTO t_test VALUES (1, 2);
...
INSERT INTO t_test VALUES (1, 2);
The reason is that PostgreSQL does not have to flush every
transaction anymore.
Trading “durability” for performance
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
10. Use bulk loads
Loading single rows is usually a bad idea
Use COPY to do bulk loading
COPY can load data A LOT faster than INSERT due to
significantly smaller overhead
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
11. A simple COPY
This time 10 million lines are imported: (10.000 rows are not
enough)
COPY t_test FROM stdin;
1 2
1 2
...
.
Note the performance difference (rows per second). There is no
need to check column lists, existence of table, etc. anymore ->
higher throughput
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
12. COPY: Observations to be made
In the default configuration (= checkpoint segments = 3) you
will see a steady up and down of I/O speed
This is caused by checkpoints happening in the background
Data has to go to the transaction log to “repair” data files in
case of a crash
Performance is limited by writing data twice
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
13. COPY: Using a transaction log bypass
Writing to the transaction log can be avoided in some cases:
BEGIN;
TRUNCATE t_test;
COPY t_test FROM stdin;
...
COMMIT;
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
14. COPY: Why the bypass works
TRUNCATE will schedule the removal of the data file on
COMMIT
COPY will start writing to a new data file
Concurrency is not an issue because TRUNCATE locks the
table
PostgreSQL can take the old or the new data file on COMMIT
or ROLLBACK
no need to actually repair a data file anymore
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
15. COPY: More on WAL-bypassing
BEGIN;
CREATE TABLE ...
WAL bypassing only works if you are not using streaming
replication
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
16. Freezing rows
Before you import data, run . . .
ALTER TABLE t_test
SET (autovacuum_enabled = off);
Compare the timing of the first
SELECT count(id) FROM t_test;
with the second one.
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
17. Observations
The first run is a lot slower
During the first run, writes will happen
No more writes from the second run on
PostgreSQL sets bits in the background
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
18. The purpose of hint bits
On first write PostgreSQL checks if a row can be seen by
everybody.
This bit is set to make sure that PostgreSQL does not have to
go through expensive visibility checks next time.
This is an issue for big data sets
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
19. Fixing after-import performance
To set hint bits straight away do a
test=# COPY t_test FROM ’/tmp/file.txt’ FREEZE;
Be careful. It only works in case . . .
ERROR: cannot perform FREEZE because the table
was not created or truncated in the
current subtransaction
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
20. VACUUM and hint bits
Bits can be set on an entire block as well (not just on rows)
VACUUM will set those hint bits
However, block bits will not speed things up as much as
row-level bits usually
For a heavily used read-only database system vacuuming data
can actually make sense (not to reclaim space)
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
21. Importing some test data:
Here is some test data:
CREATE TABLE t_oil (country text,
year int,
production int);
COPY t_oil FROM PROGRAM
’curl www.cybertec.at/secret/oil.txt’;
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
23. Basic aggregation
test=# SELECT country, avg(production)
FROM t_oil
GROUP BY 1;
country | avg
---------------+-----------------------
USA | 9141.3478260869565217
Saudi Arabien | 7641.8260869565217391
(2 rows)
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
24. GROUP BY is needed for aggregates
A GROUP BY clause is needed because otherwise groups
cannot be built:
test=# SELECT country, avg(production) FROM t_oil;
ERROR: column "t_oil.country" must appear in the
GROUP BY clause or be used in an
aggregate function
LINE 1: SELECT country, avg(production) FROM t_oil;
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
25. HAVING: Filtering on aggregated data
test=# SELECT country, avg(production)
FROM t_oil
GROUP BY 1
HAVING avg(production) > 8000;
country | avg
---------+-----------------------
USA | 9141.3478260869565217
(1 row)
NOTE that an alias is not allowed in a HAVING clause
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
27. The purpose of windowing
An analogy:
Your car is not a valuable car because it is good
It is valuable because it is better than the ones driven by your
friends
This is what windowing does: The current row in relation to all
rows in the reference group
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
28. Windowing vs GROUP BY
GROUP BY has been designed to reduce the amount of data
and turn it into aggregated values
Windowing is used to compare values and put them into
relation.
Windowing is used along with aggregate functions (e.g. sum,
count, avg, min, max, . . . )
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
29. A simple aggregate: Average values
SELECT *, avg(production) OVER () FROM t_oil ;
country | year | production | avg
---------------+------+------------+---------------
USA | 1965 | 9014 | 8391.58695652
USA | 1966 | 9579 | 8391.58695652
USA | 1967 | 10219 | 8391.58695652
USA | 1968 | 10600 | 8391.58695652
USA | 1969 | 10828 | 8391.58695652
...
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
30. What does the result mean?
‘Give me all rows and the average “over” all rows in the table’
Logically it is the same as . . .
SELECT *, (SELECT avg(production) FROM t_oil) AS avg
FROM t_oil;
However, subselects can be very nasty if the task is a more
complex one
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
31. OVER()-clauses can define order
Calculate max production up to a certain point
SELECT *, max(production) OVER (ORDER BY year)
FROM t_oil
WHERE country = ’Saudi Arabien’;
Saudi Arabia is a so called ‘swing producer’.
Note that max stays up even if production declines
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
32. OVER()-clauses can form groups
Averages for each country
SELECT *, avg(production)
OVER (PARTITION BY country)
FROM t_oil;
country | year | production | avg
---------------+------+------------+---------------
Saudi Arabien | 1965 | 2219 | 7641.82608695
Saudi Arabien | 1966 | 2615 | 7641.82608695
...
USA | 1965 | 9014 | 9141.34782608
USA | 1966 | 9579 | 9141.34782608
...
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
33. Forming groups
Data is split into groups
Each row shows the average of all rows in its group
Note that we got one group (= window) per country
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
34. OVER() can contain order and groups
SELECT *, max(production)
OVER (PARTITION BY country ORDER BY year)
FROM t_oil;
In this case we get the maximum up to a given point
This is done for each country
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
35. Abstracting window-clauses
SELECT *,
min(production) OVER (w),
max(production) OVER (w),
count(production) OVER (w)
FROM t_oil
WINDOW w AS (PARTITION BY country ORDER BY year)
the same clause can be used for many columns
many window-clauses may exist (w, w2, w3, etc.)
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
36. rank() and dense rank()
Data can be ranked according to some order
In case of duplicates
rank gives 1, 2, 2, 2, 5
dense rank gives 1, 2, 2, 2, 3
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
37. Moving rows: lead
ORDER BY defines into which direction to “move” the row
the number defines the offset
SELECT *, lag(production, 1) OVER (ORDER BY year)
FROM t_oil WHERE country = ’USA’;
country | year | production | lag
---------+------+------------+-------
USA | 1965 | 9014 |
USA | 1966 | 9579 | 9014
USA | 1967 | 10219 | 9579
USA | 1968 | 10600 | 10219
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
38. Calculating the change in production
Very easy thing to do now
SELECT *, production - lag(production, 1)
OVER (ORDER BY year)
FROM t_oil WHERE country = ’USA’;
country | year | production | ?column?
---------+------+------------+----------
USA | 1965 | 9014 |
USA | 1966 | 9579 | 565
USA | 1967 | 10219 | 640
USA | 1968 | 10600 | 381
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
39. lead is the opposite of lag
lag is the same as ‘lead(. . . , -1)’
lag pushes elements down
lead pushes elemens up
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
40. moving entire rows
SELECT *, lag(t_oil, 1) OVER (ORDER BY year)
FROM t_oil WHERE country = ’USA’;
country | year | production | lag
---------+------+------------+------------------
USA | 1965 | 9014 |
USA | 1966 | 9579 | (USA,1965,9014)
USA | 1967 | 10219 | (USA,1966,9579)
USA | 1968 | 10600 | (USA,1967,10219)
the composite type can then be disected using a subselect for
the current query
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
41. works for more than just one column
SELECT *, lag((year, production), 1)
OVER (ORDER BY year)
FROM t_oil WHERE country = ’USA’;
country | year | production | lag
---------+------+------------+--------------
USA | 1965 | 9014 |
USA | 1966 | 9579 | (1965,9014)
USA | 1967 | 10219 | (1966,9579)
USA | 1968 | 10600 | (1967,10219)
this is the perfect foundation to build custom aggregates to
solve complex problems
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
42. Splitting data into equal parts
ntile can split your data into n equally sized blocks
ntile(4) will therefore give you a nice quantile distribution
order is needed to achieve that
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
43. Here is how it works . . .
SELECT year, production, ntile(4)
OVER (ORDER BY production)
FROM t_oil WHERE country = ’USA’ ORDER BY 3, 2 DESC;
year | production | ntile
------+------------+-------
2000 | 7733 | 1
1999 | 7731 | 1
...
1966 | 9579 | 2
1989 | 9159 | 2
...
1972 | 11185 | 4
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
44. Work can proceed from there
SELECT ntile, min(production), max(production)
FROM ( SELECT year, production, ntile(4)
OVER (ORDER BY production)
FROM t_oil WHERE country = ’USA’) AS x
GROUP BY 1 ORDER BY 1
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
46. Moving averages
More sophisticated frame-clauses are needed
The average is done for 2 years = current + previous one
SELECT *, avg(production) OVER (ORDER BY year ROWS
BETWEEN 1 PRECEDING AND 0 FOLLOWING)
FROM t_oil WHERE country = ’Saudi Arabien’;
country | year | production | avg
---------------+------+------------+------------
Saudi Arabien | 1965 | 2219 | 2219.0000
Saudi Arabien | 1966 | 2615 | 2417.0000
Saudi Arabien | 1967 | 2825 | 2720.0000
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
48. Combining data
To combine data we need some more import data
CREATE TABLE t_president
(name text,
start_year int,
end_year int,
party text);
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
49. Some input data
A list of all American presidents and their presidency
test=# COPY t_president FROM PROGRAM
’curl www.cybertec.at/secret/president.txt’;
COPY 9
The format is not too nice for analysis
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
50. Input data: American presidents
SELECT * FROM t_president ;
name | start_year | end_year | party
-------------------+------------+----------+------------
Lyndon B. Johnson | 1963 | 1969 | Democrat
Richard M. Nixon | 1969 | 1974 | Republican
Gerald Ford | 1974 | 1977 | Republican
Jimmy Carter | 1977 | 1981 | Democrat
Ronald W. Reagan | 1981 | 1989 | Republican
George H. W. Bush | 1989 | 1993 | Republican
Bill Clinton | 1993 | 2001 | Democrat
George W. Bush | 2001 | 2009 | Republican
Barack Obama | 2009 | 2017 | Democrat
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
51. The challenge: Adjust the format
LATERAL can come to the rescue
SELECT name, party, year
FROM t_president AS x,
LATERAL (SELECT * FROM
generate_series(x.start_year, x.end_year - 1)
AS year) AS y
LIMIT 8;
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
52. The output is:
name | party | year
-------------------+------------+------
Lyndon B. Johnson | Democrat | 1963
Lyndon B. Johnson | Democrat | 1964
Lyndon B. Johnson | Democrat | 1965
Lyndon B. Johnson | Democrat | 1966
Lyndon B. Johnson | Democrat | 1967
Lyndon B. Johnson | Democrat | 1968
Richard M. Nixon | Republican | 1969
Richard M. Nixon | Republican | 1970
Richard M. Nixon | Republican | 1971
Richard M. Nixon | Republican | 1972
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
53. Which party is better for oil?
The following way to solve the problem is definitely not the
only one.
There might be other factors than the party of the president
when it comes to this kind of data.
Keep in mind: It is just an SQL exercise
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
54. Putting things together (1)
CREATE VIEW v AS
WITH b AS (
SELECT name, party, year
FROM t_president AS x,
LATERAL (SELECT * FROM generate_series(
x.start_year,
x.end_year - 1) AS year) AS y)
SELECT a.*, party,
production - lag(production, 1)
OVER (ORDER BY a.year) AS lag
FROM t_oil AS a, b
WHERE a.year = b.year AND country = ’USA’;
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
55. What we got so far
SELECT * FROM v;
country | year | production | party | lag
---------+------+------------+------------+------
USA | 1965 | 9014 | Democrat |
USA | 1966 | 9579 | Democrat | 565
USA | 1967 | 10219 | Democrat | 640
USA | 1968 | 10600 | Democrat | 381
USA | 1969 | 10828 | Republican | 228
USA | 1970 | 11297 | Republican | 469
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
56. Making use of NULL
Remember NULL is ignored by aggregate functions
We can use that to do ‘partial counts’
SELECT party, lag,
CASE WHEN lag > 0 THEN 1 END AS up,
CASE WHEN lag < 0 THEN 1 END AS down
FROM v
ORDER BY year;
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
58. We can move on from there easily
SELECT party,
count(CASE WHEN lag > 0 THEN 1 END) AS up,
count(CASE WHEN lag < 0 THEN 1 END) AS down
FROM v
GROUP BY party;
party | up | down
------------+----+------
Democrat | 9 | 8
Republican | 10 | 18
(2 rows)
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
60. Preparing our sample data
test=# UPDATE t_oil
SET production = NULL
WHERE year IN (1998, 1999)
AND country = ’USA’ RETURNING *;
country | year | production
---------+------+------------
USA | 1998 |
USA | 1999 |
(2 rows)
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
61. Challenges ahead
How can we make lead and lag work again?
How can we fill the gaps?
How can we control our behavior in a more efficient way?
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
62. Turning to frame-clauses once again
One idea is to just use the average of some previous values
However, you might also want to turn to interpolation or
outright guess work
A custom aggregate might help
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
63. A ‘lazy’ idea
Creating an array with some historic values
Applying a function on this array
SELECT year, production, array_agg(production)
OVER (ORDER BY year ROWS BETWEEN 3 PRECEDING
AND 0 FOLLOWING)
FROM t_oil
WHERE country = ’USA’;
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
65. Applying a function
A simple function could look like this:
SELECT avg(x)
FROM unnest(’{8295,8269,NULL,NULL}’::int4[]) AS x;
avg
-----------------------
8282.0000000000000000
(1 row)
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
66. A query could therefore look like this
SELECT *, (SELECT avg(x) FROM unnest(array_agg) AS x)
FROM (SELECT year, production, array_agg(production)
OVER (ORDER BY year ROWS BETWEEN 3 PRECEDING
AND 0 FOLLOWING)
FROM t_oil WHERE country = ’USA’) AS y
OFFSET 32 LIMIT 4;
year | production | array_agg | avg
------+------------+-----------------------+-------------
1997 | 8269 | {8389,8322,8295,8269} | 8318.750000
1998 | | {8322,8295,8269,NULL} | 8295.333333
1999 | | {8295,8269,NULL,NULL} | 8282.000000
2000 | 7733 | {8269,NULL,NULL,7733} | 8001.000000
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
67. Defining an aggregate
Defining an aggregate is really the more desirable way
It is ways more clean
CREATE AGGREGATE is your friend
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
68. A simple example
the aggregate can be created like this:
CREATE FUNCTION my_final(int[]) RETURNS numeric AS
$$
SELECT avg(x) FROM unnest($1) AS x;
$$ LANGUAGE sql;
CREATE AGGREGATE artificial_avg(int) (
SFUNC = array_append,
STYPE = int[],
INITCOND = ’{}’,
FINALFUNC = my_final
);
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
69. Using our new aggregate
SELECT year, production, artificial_avg(production)
OVER (ORDER BY year
ROWS BETWEEN 3 PRECEDING AND 0 FOLLOWING)
FROM t_oil WHERE country = ’USA’;
the aggregate can be used just like any other aggregate in the
system
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
71. Thank you for your attention
Cybertec Sch¨onig & Sch¨onig GmbH
Gr¨ohrm¨uhlgasse 26
A-2700 Wiener Neustadt
www.postgresql-support.de
Hans-J¨urgen Sch¨onig
www.postgresql-support.de