SlideShare a Scribd company logo
1 of 44
Download to read offline
@martin_loetzsch
Dr. Martin Loetzsch
PostgreSQL Meetup Berlin May 2018
Data Integration with PostgreSQL
All the data of the company in one place


Data is
the single source of truth
cleaned up & validated
documented
embedded into the organisation
Integration of different domains





















Main challenges
Consistency & correctness
Changeability
Complexity
Transparency
!2
Data warehouse = integrated data
@martin_loetzsch
Nowadays required for running a business
application
databases
events
csv files
apis
reporting
crm
marketing
…
search
pricing
DWH orders
users
products
price 

histories
emails
clicks
…
…
operation

events
!3
Which technology?
@martin_loetzsch
PostgreSQL as a data integration engine
@martin_loetzsch
!4
Target of computation


CREATE TABLE m_dim_next.region (

region_id SMALLINT PRIMARY KEY,

region_name TEXT NOT NULL UNIQUE,

country_id SMALLINT NOT NULL,

country_name TEXT NOT NULL,

_region_name TEXT NOT NULL

);



Do computation and store result in table


WITH raw_region
AS (SELECT DISTINCT
country,

region

FROM m_data.ga_session

ORDER BY country, region)



INSERT INTO m_dim_next.region
SELECT
row_number()
OVER (ORDER BY country, region ) AS region_id,

CASE WHEN (SELECT count(DISTINCT country)
FROM raw_region r2
WHERE r2.region = r1.region) > 1
THEN region || ' / ' || country
ELSE region END AS region_name,
dense_rank() OVER (ORDER BY country) AS country_id,
country AS country_name,
region AS _region_name
FROM raw_region r1;

INSERT INTO m_dim_next.region
VALUES (-1, 'Unknown', -1, 'Unknown', 'Unknown');

Speedup subsequent transformations


SELECT util.add_index(
'm_dim_next', 'region',
column_names := ARRAY ['_region_name', ‘country_name',
'region_id']);



SELECT util.add_index(
'm_dim_next', 'region',
column_names := ARRAY ['country_id', 'region_id']);



ANALYZE m_dim_next.region;
!5
Leave data in Postgres, run SQL queries
@martin_loetzsch
Tables as (intermediate) results of processing steps
At the beginning of a data integration pipeline


DROP SCHEMA IF EXISTS m_dim_next CASCADE;

CREATE SCHEMA m_dim_next;



DROP SCHEMA IF EXISTS m_tmp CASCADE;
CREATE SCHEMA m_tmp;





Do all transformations


CREATE TABLE m_tmp.foo

AS SELECT n
FROM generate_series(0, 10000) n;

CREATE TABLE m_dim_next.foo
AS SELECT
n,
n + 1
FROM m_tmp.foo;

-- ..

At the end of pipeline, after checks


DROP SCHEMA IF EXISTS m_dim CASCADE;
ALTER SCHEMA m_dim_next RENAME TO m_dim;











Build m_dim_next schema while users are seeing m_dim
- Data shown in frontend is always consistent
- When ETL breaks, data just gets old

Explicit operations determine structure and content of DWH
- Very developer / git friendly
- Requirement: Must be possible to rebuild DWH from sources
- Add incremental loading / processing when there is a pain
!6
Drop & rebuild
@martin_loetzsch
Don’t bother with migrations
Renaming / dropping schemas requires exclusive locks


CREATE FUNCTION util.cancel_queries_on_schema(schema TEXT)
RETURNS BOOLEAN AS $$
SELECT pg_cancel_backend(pid)
FROM
(SELECT DISTINCT pid
FROM pg_locks
JOIN pg_database ON database = pg_database.oid
JOIN pg_class ON pg_class.oid = relation
JOIN pg_namespace ON relnamespace = pg_namespace.oid
WHERE datname = current_database() AND nspname = schema
AND pid != pg_backend_pid()) t;
$$ LANGUAGE SQL;





First rename, then drop


CREATE FUNCTION util.replace_schema(schemaname TEXT,
replace_with TEXT)
RETURNS VOID AS $$
BEGIN
PERFORM util.cancel_queries_on_schema(schemaname);
IF EXISTS(SELECT *
FROM information_schema.schemata s
WHERE s.schema_name = schemaname)
THEN
EXECUTE 'ALTER SCHEMA ' || schemaname
|| ' RENAME TO ' || schemaname || '_old;';
END IF;
EXECUTE 'ALTER SCHEMA ' || replace_with
|| ' RENAME TO ' || schemaname || ';';

-- again, for good measure
PERFORM util.cancel_queries_on_schema(schemaname);

EXECUTE 'DROP SCHEMA IF EXISTS ' || schemaname
|| '_old CASCADE;';
END;
$$ LANGUAGE 'plpgsql'









Nice one-liner


SELECT util.replace_schema('m_dim', 'm_dim_next');
!7
Atomic & robust schema switches
@martin_loetzsch
Transactional DDL FTW
Each pipeline maintains their own schemas
- Never write to schemas of other pipelines
- Only read from “public” schemas of other pipelines



Project A schema naming conventions
mp_dim: public tables of pipeline my-pipeline
mp_dim_next: schema for building the next version of mp_dim
mp_tmp: temporary tables that are not considered public
mp_data: tables that are not always dropped (for incremental loading)



Optional: copy “public” schema from ETL db to frontend db
!8
Schemas & pipelines
@martin_loetzsch
Sometimes confusing for analysts
128 GB machine, highly overcommitted memory


max_connections = 200

temp_buffers = 2GB
work_mem = 2GB







Postgres gracefully restarts when it can’t allocate more memory


LOG: unexpected EOF on client connection with an open transaction
LOG: server process (PID 17697) was terminated by signal 9: Killed
DETAIL: Failed process was running: SELECT
m_tmp.index_tmp_touchpoint();
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server
process
DETAIL: The postmaster has commanded this server process to roll
back the current transaction and exit, because another server
process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database
and repeat your command.
DROP SCHEMA x CASCADE;


max_locks_per_transaction = 4000











Speed up IO by crippling WAL


wal_level = minimal
fsync = off
synchronous_commit = off
full_page_writes = off
wal_buffers = -1
No protection against hardware failure!
!9
Tuning Postgres for drop & rebuild
@martin_loetzsch
Not recommended for any other use case
Running queries


cat query.sql 

| PGOPTIONS=--client-min-messages=warning psql --no-psqlrc 

--set ON_ERROR_STOP=on etl_db



echo "SELECT s.my_function()" 

| PGOPTIONS=--client-min-messages=warning psql --no-psqlrc 

--set ON_ERROR_STOP=on etl_db









Loading files


unzip -p data.csv 

| python mapper_script.py 

| PGOPTIONS=--client-min-messages=warning psql --no-psqlrc 

--set ON_ERROR_STOP=on etl_db 
--command="COPY s.target_table FROM STDIN"

Loading from other databases


cat source-query.sql 

| mysql --skip-column-names source_db 

| PGOPTIONS=--client-min-messages=warning psql --no-psqlrc 

--set ON_ERROR_STOP=on etl_db 
--command="COPY s.target_table FROM STDIN
WITH NULL AS 'NULL'"











Don’t use ORMS / client libraries

Use “/usr/bin/env bash -o pipefail” as shell
!10
Main interface to Postgres: psql
@martin_loetzsch
Hassle-free & fast
Consistency & correctness
@martin_loetzsch
!11
It’s easy to make mistakes during ETL


DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s;



CREATE TABLE s.city (
city_id SMALLINT,
city_name TEXT,
country_name TEXT
);

INSERT INTO s.city VALUES
(1, 'Berlin', 'Germany'),
(2, 'Budapest', 'Hungary');



CREATE TABLE s.customer (
customer_id BIGINT,
city_fk SMALLINT
);

INSERT INTO s.customer VALUES
(1, 1),
(1, 2),
(2, 3);

Customers per country?


SELECT
country_name,
count(*) AS number_of_customers
FROM s.customer JOIN s.city 

ON customer.city_fk = s.city.city_id
GROUP BY country_name;



Back up all assumptions about data by constraints


ALTER TABLE s.city ADD PRIMARY KEY (city_id);
ALTER TABLE s.city ADD UNIQUE (city_name);
ALTER TABLE s.city ADD UNIQUE (city_name, country_name);


ALTER TABLE s.customer ADD PRIMARY KEY (customer_id);
[23505] ERROR: could not create unique index "customer_pkey"
Detail: Key (customer_id)=(1) is duplicated.

ALTER TABLE s.customer ADD FOREIGN KEY (city_fk)
REFERENCES s.city (city_id);
[23503] ERROR: insert or update on table "customer" violates
foreign key constraint "customer_city_fk_fkey"
Detail: Key (city_fk)=(3) is not present in table "city"
!12
Referential consistency
@martin_loetzsch
Only very little overhead, will save your ass
10/18/2017 2017-10-18-dwh-schema-pav.svg
customer
customer_id
first_order_fk
favourite_product_fk
lifetime_revenue
product
product_id
revenue_last_6_months
order
order_id
processed_order_id
customer_fk
product_fk
revenue
Never repeat “business logic”


SELECT sum(total_price) AS revenue
FROM os_data.order
WHERE status IN ('pending', 'accepted', 'completed',

'proposal_for_change');




SELECT CASE WHEN (status <> 'started'
AND payment_status = 'authorised'
AND order_type <> 'backend')
THEN o.order_id END AS processed_order_fk
FROM os_data.order;



SELECT (last_status = 'pending') :: INTEGER AS is_unprocessed
FROM os_data.order;









Refactor pipeline
Create separate task that computes everything we know about an order
Usually difficult in real life











Load → preprocess → transform → flatten-fact
!13
Computational consistency
@martin_loetzsch
Requires discipline
load-product load-order load-customer
preprocess-product preprocess-order preprocess-customer
transform-product transform-order transform-customer
flatten-product-fact flatten-order-fact flatten-customer-fact
CREATE FUNCTION m_tmp.normalize_utm_source(TEXT)
RETURNS TEXT AS $$
SELECT
CASE
WHEN $1 LIKE '%.%' THEN lower($1)
WHEN $1 = '(direct)' THEN 'Direct'
WHEN $1 LIKE 'Untracked%' OR $1 LIKE '(%)'
THEN $1
ELSE initcap($1)
END;
$$ LANGUAGE SQL IMMUTABLE;

CREATE FUNCTION util.norm_phone_number(phone_number TEXT)
RETURNS TEXT AS $$
BEGIN
phone_number := TRIM(phone_number);
phone_number := regexp_replace(phone_number, '(0)', '');
phone_number
:= regexp_replace(phone_number, '[^[:digit:]]', '', 'g');
phone_number
:= regexp_replace(phone_number, '^(+49|0049|49)', '0');
phone_number := regexp_replace(phone_number, '^(00)', '');
phone_number := COALESCE(phone_number, '');
RETURN phone_number;
END;
$$ LANGUAGE PLPGSQL IMMUTABLE;

CREATE FUNCTION m_tmp.compute_ad_id(id BIGINT, api m_tmp.API)
RETURNS BIGINT AS $$
-- creates a collision free ad id from an id in a source system
SELECT ((CASE api
WHEN 'adwords' THEN 1
WHEN 'bing' THEN 2
WHEN 'criteo' THEN 3
WHEN 'facebook' THEN 4
WHEN 'backend' THEN 5
END) * 10 ^ 18) :: BIGINT + id
$$ LANGUAGE SQL IMMUTABLE;





CREATE FUNCTION pv.date_to_supplier_period_start(INTEGER)
RETURNS INTEGER AS $$
-- this maps all dates to either a integer which is included
-- in lieferantenrabatt.period_start or
-- null (meaning we don't have a lieferantenrabatt for it)
SELECT
CASE
WHEN $1 >= 20170501 THEN 20170501
WHEN $1 >= 20151231 THEN 20151231
ELSE 20151231
END;
$$ LANGUAGE SQL IMMUTABLE;
!14
When not possible: use functions
@martin_loetzsch
Almost no performance overhead
Check for “lost” rows


SELECT util.assert_equal(
'The order items fact table should contain all order items',
'SELECT count(*) FROM os_dim.order_item',
'SELECT count(*) FROM os_dim.order_items_fact');







Check consistency across cubes / domains


SELECT util.assert_almost_equal(
'The number of first orders should be the same in '

|| 'orders and marketing touchpoints cube',
'SELECT count(net_order_id)
FROM os_dim.order
WHERE _net_order_rank = 1;',

'SELECT (SELECT sum(number_of_first_net_orders)
FROM m_dim.acquisition_performance)
/ (SELECT count(*)
FROM m_dim.performance_attribution_model)',
1.0
);

Check completeness of source data


SELECT util.assert_not_found(
'Each adwords campaign must have the attribute "Channel"',
'SELECT DISTINCT campaign_name, account_name
FROM aw_tmp.ad
JOIN aw_dim.ad_performance ON ad_fk = ad_id
WHERE attributes->>''Channel'' IS NULL
AND impressions > 0
AND _date > now() - INTERVAL ''30 days''');



Check correctness of redistribution transformations


SELECT util.assert_almost_equal_relative(
'The cost of non-converting touchpoints must match the'
|| 'redistributed customer acquisition and reactivation cost',
'SELECT sum(cost)
FROM m_tmp.cost_of_non_converting_touchpoints;',
'SELECT
(SELECT sum(cost_per_touchpoint * number_of_touchpoints)
FROM m_tmp.redistributed_customer_acquisition_cost)
+ (SELECT sum(cost_per_touchpoint * number_of_touchpoints)
FROM m_tmp.redistributed_customer_reactivation_cost);',
0.00001);
!15
Data consistency checks
@martin_loetzsch
Makes changing things easy
Execute queries and compare results

CREATE FUNCTION util.assert(description TEXT, query TEXT)
RETURNS BOOLEAN AS $$
DECLARE
succeeded BOOLEAN;
BEGIN
EXECUTE query INTO succeeded;
IF NOT succeeded THEN RAISE EXCEPTION 'assertion failed:
# % #
%', description, query;
END IF;
RETURN succeeded;
END
$$ LANGUAGE 'plpgsql';







CREATE FUNCTION util.assert_almost_equal_relative(
description TEXT, query1 TEXT,
query2 TEXT, percentage DECIMAL)
RETURNS BOOLEAN AS $$
DECLARE
result1 NUMERIC;
result2 NUMERIC;
succeeded BOOLEAN;
BEGIN
EXECUTE query1 INTO result1;
EXECUTE query2 INTO result2;
EXECUTE 'SELECT abs(' || result2 || ' - ' || result1 || ') / '
|| result1 || ' < ' || percentage INTO succeeded;
IF NOT succeeded THEN RAISE WARNING '%
assertion failed: abs(% - %) / % < %
%: (%)
%: (%)', description, result2, result1, result1, percentage,
result1, query1, result2, query2;
END IF;
RETURN succeeded;
END
$$ LANGUAGE 'plpgsql';
!16
Consistency check functions
@martin_loetzsch
Also: assert_not_found, assert_equal_table, assert_smaller_than_or_equal
Yes, unit tests
SELECT util.assert_value_equal('test_german_number_with_country_prefix', util.norm_phone_number('00491234'), '01234');
SELECT util.assert_value_equal('test_german_number_not_to_be_confused_with_country_prefix', util.norm_phone_number('0491234'), '0491234');
SELECT util.assert_value_equal('test_non_german_number_with_plus', util.norm_phone_number('+44 1234'), '441234');
SELECT util.assert_value_equal('test_german_number_with_prefix_and_additional_zero', util.norm_phone_number('+49 (0)1234'), '01234');
SELECT util.assert_value_equal('test__trim', util.norm_phone_number(' 0491234 '), '0491234');
SELECT util.assert_value_equal('test_number_with_leading_wildcard_symbol', util.norm_phone_number('*+436504834933'), '436504834933');
SELECT util.assert_value_equal('test_NULL', util.norm_phone_number(NULL), '');
SELECT util.assert_value_equal('test_empty', util.norm_phone_number(''), '');
SELECT util.assert_value_equal('test_wildcard_only', util.norm_phone_number('*'), '');
SELECT util.assert_value_equal('test_foreign_number_with_two_leading_zeroes', util.norm_phone_number('*00436769553701'), '436769553701');
SELECT util.assert_value_equal('test_domestic_number_with_trailing_letters', util.norm_phone_number('017678402HORST'), '017678402');
SELECT util.assert_value_equal('test_domestic_number_with_leading_letters', util.norm_phone_number('HORST017678402'), '017678402');
SELECT util.assert_value_equal('test_domestic_number_with_letters_in_between', util.norm_phone_number('0H1O7R6S7T8402'), '017678402');
SELECT util.assert_value_equal('test_german_number_with_country_prefix_and_leading_letters',
util.norm_phone_number('HORST00491234'), '01234');
SELECT util.assert_value_equal(‘test_german_number_not_to_be_confused_with_country_prefix_and_leading_letters',
util.norm_phone_number('HORST0491234'), '0491234');
SELECT util.assert_value_equal('test_non_german_number_with_plus_and_leading_letters', util.norm_phone_number('HORST+44 1234'), '441234');
SELECT util.assert_value_equal('test_german_number_with_prefix_and_additional_zero_and_leading_letters',
util.norm_phone_number('HORST+49 (0)1234'), ‘01234');
!17
Unit tests
@martin_loetzsch
People enter horrible telephone numbers into websites
Contribution margin 3a
SELECT order_item_id,
((((((COALESCE(item_net_price, 0)::REAL
+ COALESCE(net_shipping_revenue, 0)::REAL)
- ((COALESCE(item_net_purchase_price, 0)::REAL
+ COALESCE(alcohol_tax, 0)::REAL)
+ COALESCE(import_tax, 0)::REAL))
- (COALESCE(net_fulfillment_costs, 0)::REAL
+ COALESCE(net_payment_costs, 0)::REAL))
- COALESCE(net_return_costs, 0)::REAL)
- ((COALESCE(item_net_price, 0)::REAL
+ COALESCE(net_shipping_revenue, 0)::REAL)
- ((((COALESCE(item_net_price, 0)::REAL
+ COALESCE(item_tax_amount, 0)::REAL)
+ COALESCE(gross_shipping_revenue, 0)::REAL)
- COALESCE(voucher_gross_amount, 0)::REAL)
* (1 - ((COALESCE(item_tax_amount, 0)::REAL
+ (COALESCE(gross_shipping_revenue, 0)::REAL
- COALESCE(net_shipping_revenue, 0)::REAL))
/ NULLIF(((COALESCE(item_net_price, 0)::REAL
+ COALESCE(item_tax_amount, 0)::REAL)
+ COALESCE(gross_shipping_revenue,
0)::REAL), 0))))))
- COALESCE(goodie_cost_per_item, 0)::REAL) :: DOUBLE PRECISION
AS "Contribution margin 3a"
FROM dim.sales_fact;
Use schemas between reporting and database
Mondrian
LookerML
your own
Or: Pre-compute metrics in database
!18
Semantic consistency
@martin_loetzsch
Changing the meaning of metrics across all dashboards needs to be easy
Divide & conquer strategies
@martin_loetzsch
!19
Create two tables, fill with 10M values


DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s;




CREATE TABLE s.table_1 (
user_id BIGINT,
some_number INTEGER);


INSERT INTO s.table_1
SELECT n, n FROM generate_series(1, 10000000) n;




CREATE TABLE s.table_2 AS SELECT * FROM s.table_1;

Join both tables


EXPLAIN ANALYZE
SELECT *
FROM s.table_1
JOIN s.table_2 USING (user_id);



Let’s assume 18 seconds is slow


Merge Join (rows=10000000)

Merge Cond: (table_1.user_id = table_2.user_id)

-> Sort (rows=10000000)

Sort Key: table_1.user_id

Sort Method: external merge Disk: 215032kB

-> Seq Scan on table_1 (rows=10000000)

-> Materialize (rows=10000000)

-> Sort (rows=10000000)

Sort Key: table_2.user_id

Sort Method: external merge Disk: 254144kB

-> Seq Scan on table_2 (rows=10000000)

Planning time: 0.700 ms

Execution time: 18278.520 ms
!20
Joining large tables can be slow
@martin_loetzsch
Think of: join all user touchpoints with all user orders
Create table_1 with 5 partitions


DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s;


CREATE TABLE s.table_1 (
user_id BIGINT NOT NULL,
user_chunk SMALLINT NOT NULL,
some_number INTEGER NOT NULL);




CREATE TABLE s.table_1_0 (CHECK (user_chunk = 0))
INHERITS (s.table_1);


CREATE TABLE s.table_1_1 (CHECK (user_chunk = 1))
INHERITS (s.table_1);


CREATE TABLE s.table_1_2 (CHECK (user_chunk = 2))
INHERITS (s.table_1);


CREATE TABLE s.table_1_3 (CHECK (user_chunk = 3))
INHERITS (s.table_1);


CREATE TABLE s.table_1_4 (CHECK (user_chunk = 4))
INHERITS (s.table_1);

Insert directly into partition (don’t use triggers etc.)


INSERT INTO s.table_1_0
SELECT n, n % 5, n from generate_series(0, 10000000, 5) n;


Automate insertion


CREATE FUNCTION s.insert_data(table_name TEXT, chunk SMALLINT)
RETURNS VOID AS $$
BEGIN
EXECUTE 'INSERT INTO ' || table_name || '_' || chunk
|| ' SELECT n, n % 5, n from generate_series(‘
|| chunk || ', 10000000, 5) n';
EXECUTE 'ANALYZE ' || table_name || '_' || chunk;
END;
$$ LANGUAGE plpgsql;

Run in parallel (depends on ETL framework)


SELECT s.insert_data('s.table_1', 1 :: SMALLINT);
SELECT s.insert_data('s.table_1', 2 :: SMALLINT);
SELECT s.insert_data('s.table_1', 3 :: SMALLINT);
SELECT s.insert_data('s.table_1', 4 :: SMALLINT);
!21
Splitting data in chunks / partitions
@martin_loetzsch
user chunk = user_id % 5
Build DDL statement and execute it


CREATE FUNCTION s.create_table_partitions(schemaname TEXT,
tablename TEXT,
key_column TEXT,
keys TEXT[])
RETURNS VOID AS $$
DECLARE key TEXT;
BEGIN
FOREACH KEY IN ARRAY keys LOOP
IF
NOT EXISTS(SELECT 1
FROM information_schema.tables t
WHERE t.table_schema = schemaname
AND t.table_name = tablename || '_' || key)
THEN
EXECUTE 'CREATE TABLE ' || schemaname || '.' || tablename
|| '_' || key || ' ( CHECK (' || key_column
|| ' = ' || key || ') ) INHERITS (' || schemaname
|| '.' || tablename || ');';
END IF;
END LOOP;
END
$$ LANGUAGE plpgsql;

Create table_2


CREATE TABLE s.table_2 (
user_id BIGINT NOT NULL,
user_chunk SMALLINT NOT NULL,
some_number INTEGER NOT NULL
);


SELECT s.create_table_partitions(
's', 'table_2', 'user_chunk',
(SELECT array(SELECT n :: TEXT FROM generate_series(0, 4) n)));









Run in parallel (depends on ETL framework):


SELECT s.insert_data('s.table_2', 0 :: SMALLINT);
SELECT s.insert_data('s.table_2', 1 :: SMALLINT);
SELECT s.insert_data('s.table_2', 2 :: SMALLINT);
SELECT s.insert_data('s.table_2', 3 :: SMALLINT);
SELECT s.insert_data('s.table_2', 4 :: SMALLINT);
!22
Automating partition management
@martin_loetzsch
Also great: pg_partman, Postgres 10 declarative partitioning
Join table_1 and table_2 for chunk 0


EXPLAIN ANALYZE
SELECT
user_id,
0 AS user_chunk,
table_1.some_number + table_2.some_number AS sum
FROM s.table_1
JOIN s.table_2 USING (user_id)
WHERE table_1.user_chunk = 0 AND table_2.user_chunk = 0;

Similar to original join (18 seconds), but almost 5x faster


Merge Join
Merge Cond: (table_1.user_id = table_2.user_id)
-> Sort (rows=2000001)
Sort Key: table_1.user_id
Sort Method: external merge Disk: 43008kB
-> Append (rows=2000001)
-> Seq Scan on table_1 (rows=0)
Filter: (user_chunk = 0)
-> Seq Scan on table_1_0 (rows=2000001)
Filter: (user_chunk = 0)
-> Sort (rows=2000001)
Sort Key: table_2.user_id
Sort Method: external sort Disk: 58664kB
-> Append (rows=2000001)
-> Seq Scan on table_2 (rows=0)
Filter: (user_chunk = 0)
-> Seq Scan on table_2_0 (rows=2000001)
Filter: (user_chunk = 0)
Planning time: 5.171 ms
Execution time: 3901.282 ms
!23
Querying chunked data
@martin_loetzsch
No chunk restriction on table_2


EXPLAIN ANALYZE
SELECT
user_id,
0 AS user_chunk,
table_1.some_number + table_2.some_number AS sum
FROM s.table_1
JOIN s.table_2 USING (user_id)
WHERE table_1.user_chunk = 0;

Append operator quite costly


Merge Join (rows=2000001)
Merge Cond: (table_1.user_id = table_2.user_id)
-> Sort (rows=2000001)
Sort Key: table_1.user_id
Sort Method: external merge Disk: 43008kB
-> Append (rows=2000001)
-> Seq Scan on table_1 (rows=0)
Filter: (user_chunk = 0)
-> Seq Scan on table_1_0 (rows=2000001)
Filter: (user_chunk = 0)
-> Materialize (rows=10000001)
-> Sort (rows=10000001)
Sort Key: table_2.user_id
Sort Method: external merge Disk: 254144kB
-> Append (rows=10000001)
-> Seq Scan on table_2 (rows=0)
-> Seq Scan on table_2_0 (rows=2000001)
-> Seq Scan on table_2_1 (rows=2000000)
-> Seq Scan on table_2_2 (rows=2000000)
-> Seq Scan on table_2_3 (rows=2000000)
-> Seq Scan on table_2_4 (rows=2000000)
Planning time: 0.371 ms
Execution time: 11654.858 ms
!24
Common mistake
@martin_loetzsch
Check for sequence scans on complete partitioned tables
Create partitioned table for target of computation


CREATE TABLE s.table_3 (
user_id BIGINT NOT NULL,
user_chunk SMALLINT NOT NULL,
sum INTEGER NOT NULL);

SELECT s.create_table_partitions(
's', 'table_3', 'user_chunk',
(SELECT array(SELECT n :: TEXT FROM generate_series(0, 4) n)));

A function for the computation


CREATE FUNCTION s.join_table_1_and_table_2(chunk SMALLINT)
RETURNS SETOF s.table_3 AS $$
BEGIN
RETURN QUERY
SELECT
user_id,
table_1.some_number + table_2.some_number AS sum,
chunk
FROM s.table_1 JOIN s.table_2 USING (user_id)
WHERE table_1.user_chunk = chunk AND table_2.user_chunk = chunk;
END
$$ LANGUAGE plpgsql;
Insert from a separate function


CREATE OR REPLACE FUNCTION s.insert_table_3(chunk INTEGER)
RETURNS VOID AS $$
BEGIN
EXECUTE 'INSERT INTO s.table_3_' || chunk
|| ' (SELECT * FROM s.join_table_1_and_table_2('
|| chunk || '::SMALLINT))';
END
$$ LANGUAGE plpgsql;











Run in parallel (depends on ETL framework)


SELECT s.insert_table_3(0);
SELECT s.insert_table_3(1);
SELECT s.insert_table_3(2);
SELECT s.insert_table_3(3);
SELECT s.insert_table_3(4);
!25
Using functions for chunked processing
@martin_loetzsch
Our best practices, depends on ETL framework
read
ga
session
map
ga
visitor
preprocess
adwords
ad
preprocess
criteo
campaign
preprocess
facebook
ad
set
transform
device
transform
host
transform
landing
page
transform
region
update
campaign
tree
preprocess
touchpoint
preprocess
ad
preprocess
campaign
tree
map
manual
cost
transform
ad
attribute
transform
ad
transform
acquisition
performance
transform
reactivation
performance
count
converting
touchpoints
transform
marketing
cost
compute
first
touchpoint
create
performance
attribution
model
read
manual
cost
flatten
touchpoints
fact
flatten
marketing
cost
fact
compute
redistributed
customer
acquisition
cost
compute
redistributed
customer
reactivation
cost
collect
non
converting
cost
compute
touchpoint
cost
transform
touchpoint
Choose chunks depending on things that are processed together
Use secondary indexes on tables that are processed in different chunks
Re-chunk on chunk “borders”



Real world example


CREATE FUNCTION m_tmp.insert_dim_touchpoint(visitor_chunk SMALLINT)
RETURNS VOID AS $$
DECLARE day_chunk SMALLINT;
BEGIN
FOR day_chunk IN (SELECT util.get_all_chunks())
LOOP
EXECUTE 'INSERT INTO m_dim_next.touchpoint_' || visitor_chunk
|| ' (SELECT * FROM m_tmp.transform_touchpoint('
|| visitor_chunk || '::SMALLINT, ' || day_chunk ||
'::SMALLINT))';
END LOOP;
END
$$ LANGUAGE plpgsql;
!26
Chunking all the way down
@martin_loetzsch
Takes discipline
day
day_chunk
visitor_chunk
attribution_model
Make changing chunk size easy


CREATE FUNCTION s.get_number_of_chunks()
RETURNS SMALLINT AS $$
SELECT 5 :: SMALLINT;
$$ LANGUAGE SQL IMMUTABLE;


CREATE FUNCTION s.get_all_chunks()
RETURNS SETOF INTEGER AS $$
SELECT generate_series(0, s.get_number_of_chunks() - 1);
$$ LANGUAGE SQL IMMUTABLE;


CREATE FUNCTION s.compute_chunk(x BIGINT)
RETURNS SMALLINT AS $$
SELECT coalesce(abs(x) % s.get_number_of_chunks(), 0) :: SMALLINT;
$$ LANGUAGE SQL IMMUTABLE;
Inline chunk computation


DROP FUNCTION s.compute_chunk( BIGINT );


DO $$
BEGIN
EXECUTE '
CREATE FUNCTION s.compute_chunk(x BIGINT)
RETURNS SMALLINT AS $_$
SELECT coalesce(abs(x) % ' || s.get_number_of_chunks() || ',
0)::SMALLINT;
$_$ LANGUAGE SQL IMMUTABLE;';
END
$$;
!27
Configurable chunking
@martin_loetzsch
Integrate with ETL framework
Becoming friends with the query planner
@martin_loetzsch
!28
Join a table with 10M values with a table with 10K values


DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s;




CREATE TABLE s.table_1 AS
SELECT a
FROM generate_series(1, 10000) a;


CREATE TABLE s.table_2 AS
SELECT a
FROM generate_series(1, 10000000) a;


CREATE INDEX table_2__a ON s.table_2 (a);




EXPLAIN ANALYZE
SELECT *
FROM s.table_2
JOIN s.table_1 USING (a);

Index on table_2 is not used


Merge Join (rows=573750000) (actual rows=10000)
Merge Cond: (table_1.a = table_2.a)
-> Sort (rows=11475) (actual rows=10000)
Sort Key: table_1.a
Sort Method: quicksort Memory: 853kB
-> Seq Scan on table_1 (rows=11475) (actual rows=10000)
-> Materialize (rows=10000000) (actual rows=10001)
-> Sort (rows=10000000) (actual rows=10001)
Sort Key: table_2.a
Sort Method: external merge Disk: 136816kB
-> Seq Scan on table_2 (rows=10000000)
(actual rows=10000000)
Planning time: 2.173 ms
Execution time: 3071.127 ms



The query planner didn’t know that table_2 has much more
distinct values than table_1
Statistics about cardinality and distribution of data are collected by
asynchronous autovacuum daemon
!29
Nondeterministic behavior?
@martin_loetzsch
Queries sometimes don’t terminate when run in ETL, but look fine when run separately
Manually collect data statistics before using new tables


DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s;




CREATE TABLE s.table_1 AS
SELECT a
FROM generate_series(1, 10000) a;


CREATE TABLE s.table_2 AS
SELECT a
FROM generate_series(1, 10000000) a;


CREATE INDEX table_2__a ON s.table_2 (a);




ANALYZE s.table_1;
ANALYZE s.table_2;




EXPLAIN ANALYZE
SELECT *
FROM s.table_2
JOIN s.table_1 USING (a);

> 300 x faster


Merge Join (rows=10000) (actual rows=10000)
Merge Cond: (table_2.a = table_1.a)
-> Index Only Scan using table_2__a on table_2 (rows=9999985)
(actual rows=10001)
Heap Fetches: 10001
-> Sort (rows=10000) (actual rows=10000)
Sort Key: table_1.a
Sort Method: quicksort Memory: 853kB
-> Seq Scan on table_1 (rows=10000) (rows=10000)
Planning time: 1.759 ms
Execution time: 9.287 ms
!30
Analyze all the things
@martin_loetzsch
Always analyze newly created tables
Add to postgresql.conf
session_preload_libraries = 'auto_explain'
auto_explain.log_min_duration = 10
auto_explain.log_nested_statements = true
auto_explain.log_verbose = true
auto_explain.log_analyze = true
debug_pretty_print = on
log_lock_waits = on
Carefully manage log file size
!31
Auto-explain everything
@martin_loetzsch
tail -f /path/to/postgres.log
Columnar storage in PostgreSQL
@martin_loetzsch
!32
!33
Postgres extension cstore_fdw
@martin_loetzsch
Quite hassle-free since version 1.6
Initialize extension


CREATE EXTENSION IF NOT EXISTS cstore_fdw;


DO $$
BEGIN
IF NOT (SELECT exists(SELECT 1
FROM pg_foreign_server
WHERE srvname = 'cstore_server'))
THEN CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw;
END IF;
END
$$;



Create fact table


DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s;


CREATE FOREIGN TABLE s.fact (
fact_id BIGSERIAL,
a TEXT,
b TEXT
)
SERVER cstore_server OPTIONS (compression 'pglz');
Insert 10M rows


INSERT INTO s.fact
SELECT
n,
('a ' || 1 + (4 * random()) :: INTEGER) AS a,
('b ' || 1 + (4 * random()) :: INTEGER) AS b
FROM generate_series(0, 10000000) n
ORDER BY a, b;







8% of original size


SELECT pg_size_pretty(pg_table_size(’s.fact'));


37 MB
!34
Save disk space with cstore tables
@martin_loetzsch
Optimized Row Columnar (ORC) format
Typical OLAP query


EXPLAIN ANALYZE
SELECT
a,
count(*) AS count
FROM s.fact
WHERE b = 'b 1'
GROUP BY a;



5 x faster than without store


HashAggregate (rows=5)
Group Key: a
-> Foreign Scan on fact (rows=1248701)
Filter: (b = 'b 1'::text)
Rows Removed by Filter: 41299
CStore File: /postgresql/9.6/cstore_fdw/836136/849220
CStore File Size: 39166772
Planning time: 10.057 ms
Execution time: 257.107 ms

Highly selective query


EXPLAIN ANALYZE
SELECT
count(*) AS count
FROM s.fact
WHERE a = 'a 1'
AND b = 'b 1’;





Really fast


Aggregate (rows=1)
-> Foreign Scan on fact (rows=156676)
Filter: ((a = 'a 1'::text) AND (b = 'b 1'::text))
Rows Removed by Filter: 13324
CStore File: /postgresql/9.6/cstore_fdw/836136/849220
CStore File Size: 39166772
Planning time: 5.492 ms
Execution time: 31.689 ms
!35
Main operation: foreign scan
@martin_loetzsch
Aggregation pushdown coming soon
Consider PostgreSQL for your next ETL project
@martin_loetzsch
!36
Focus on the complexity of data
rather than the complexity of technology
@martin_loetzsch
!37
!38
We open sourced our BI infrastructure
@martin_loetzsch
MIT licensed, heavily relies on PostgreSQL
Avoid click-tools
hard to debug
hard to change
hard to scale with team size/ data complexity / data volume 

Data pipelines as code
SQL files, python & shell scripts
Structure & content of data warehouse are result of running code 

Easy to debug & inspect
Develop locally, test on staging system, then deploy to production
!39
Make changing and testing things easy
@martin_loetzsch
Apply standard software engineering best practices
Megabytes
Plain scripts
Petabytes
Apache Airflow
In between
Mara
Example pipeline


pipeline = Pipeline(id='demo', description='A small pipeline ..’)

pipeline.add(
Task(id='ping_localhost', description='Pings localhost',
commands=[RunBash('ping -c 3 localhost')]))
sub_pipeline = Pipeline(id='sub_pipeline', description='Pings ..')
for host in ['google', 'amazon', 'facebook']:
sub_pipeline.add(
Task(id=f'ping_{host}', description=f'Pings {host}',
commands=[RunBash(f'ping -c 3 {host}.com')]))
sub_pipeline.add_dependency('ping_amazon', 'ping_facebook')
sub_pipeline.add(Task(id='ping_foo', description='Pings foo',
commands=[RunBash('ping foo')]),
upstreams=['ping_amazon'])
pipeline.add(sub_pipeline, upstreams=['ping_localhost'])
pipeline.add(Task(id=‘sleep', description='Sleeps for 2 seconds',
commands=[RunBash('sleep 2')]),
upstreams=[‘sub_pipeline’])
!40
ETL pipelines as code
@martin_loetzsch
Pipeline = list of tasks with dependencies between them. Task = list of commands
Execute query


ExecuteSQL(sql_file_name=“preprocess-ad.sql")
cat app/data_integration/pipelines/facebook/preprocess-ad.sql 
| PGTZ=Europe/Berlin PGOPTIONS=—client-min-messages=warning 
psql --username=mloetzsch --host=localhost --echo-all 
-—no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl

Read file


ReadFile(file_name=“country_iso_code.csv",
compression=Compression.NONE,
target_table="os_data.country_iso_code",
mapper_script_file_name=“read-country-iso-codes.py",
delimiter_char=“;")
cat "dwh-data/country_iso_code.csv" 
| .venv/bin/python3.6 "app/data_integration/pipelines/load_data/
read-country-iso-codes.py" 
| PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning 
psql --username=mloetzsch --host=localhost --echo-all 
--no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl 
--command="COPY os_data.country_iso_code FROM STDIN WITH CSV
DELIMITER AS ';'"

Copy from other databases


Copy(sql_file_name="pdm/load-product.sql", source_db_alias=“pdm",
target_table=“os_data.product",
replace={"@@db@@": "K24Pdm", "@@dbschema@@": “ps",
"@@client@@": "kfzteile24 GmbH"})

cat app/data_integration/pipelines/load_data/pdm/load-product.sql 
| sed "s/@@db@@/K24Pdm/g;s/@@dbschema@@/ps/g;s/@@client@@/
kfzteile24 GmbH/g" 
| sed 's/$/$/g;s/$/$/g' | (cat && echo ';') 
| (cat && echo ';
go') 
| sqsh -U ***** -P ******* -S ******* -D K24Pdm -m csv 
| PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning 
psql --username=mloetzsch --host=localhost --echo-all 
--no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl 
--command="COPY os_data.product FROM STDIN WITH CSV HEADER"
!41
Shell commands as interface to data & DBs
@martin_loetzsch
Nothing is faster than a unix pipe
Runnable app
Integrates PyPI project download stats with 

Github repo events
!42
Try it out: Python project stats data warehouse
@martin_loetzsch
https://github.com/mara/mara-example-project
!43
Refer us a data person, earn 400€
@martin_loetzsch
Also analysts, developers, product managers
Thank you
@martin_loetzsch
!44

More Related Content

Recently uploaded

➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Recently uploaded (20)

➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 

Featured

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Data Integration with PostgreSQL

  • 1. @martin_loetzsch Dr. Martin Loetzsch PostgreSQL Meetup Berlin May 2018 Data Integration with PostgreSQL
  • 2. All the data of the company in one place 
 Data is the single source of truth cleaned up & validated documented embedded into the organisation Integration of different domains
 
 
 
 
 
 
 
 
 
 
 Main challenges Consistency & correctness Changeability Complexity Transparency !2 Data warehouse = integrated data @martin_loetzsch Nowadays required for running a business application databases events csv files apis reporting crm marketing … search pricing DWH orders users products price 
 histories emails clicks … … operation
 events
  • 4. PostgreSQL as a data integration engine @martin_loetzsch !4
  • 5. Target of computation 
 CREATE TABLE m_dim_next.region (
 region_id SMALLINT PRIMARY KEY,
 region_name TEXT NOT NULL UNIQUE,
 country_id SMALLINT NOT NULL,
 country_name TEXT NOT NULL,
 _region_name TEXT NOT NULL
 );
 
 Do computation and store result in table 
 WITH raw_region AS (SELECT DISTINCT country,
 region
 FROM m_data.ga_session
 ORDER BY country, region)
 
 INSERT INTO m_dim_next.region SELECT row_number() OVER (ORDER BY country, region ) AS region_id,
 CASE WHEN (SELECT count(DISTINCT country) FROM raw_region r2 WHERE r2.region = r1.region) > 1 THEN region || ' / ' || country ELSE region END AS region_name, dense_rank() OVER (ORDER BY country) AS country_id, country AS country_name, region AS _region_name FROM raw_region r1;
 INSERT INTO m_dim_next.region VALUES (-1, 'Unknown', -1, 'Unknown', 'Unknown');
 Speedup subsequent transformations 
 SELECT util.add_index( 'm_dim_next', 'region', column_names := ARRAY ['_region_name', ‘country_name', 'region_id']);
 
 SELECT util.add_index( 'm_dim_next', 'region', column_names := ARRAY ['country_id', 'region_id']);
 
 ANALYZE m_dim_next.region; !5 Leave data in Postgres, run SQL queries @martin_loetzsch Tables as (intermediate) results of processing steps
  • 6. At the beginning of a data integration pipeline 
 DROP SCHEMA IF EXISTS m_dim_next CASCADE;
 CREATE SCHEMA m_dim_next;
 
 DROP SCHEMA IF EXISTS m_tmp CASCADE; CREATE SCHEMA m_tmp;
 
 
 Do all transformations 
 CREATE TABLE m_tmp.foo
 AS SELECT n FROM generate_series(0, 10000) n;
 CREATE TABLE m_dim_next.foo AS SELECT n, n + 1 FROM m_tmp.foo;
 -- ..
 At the end of pipeline, after checks 
 DROP SCHEMA IF EXISTS m_dim CASCADE; ALTER SCHEMA m_dim_next RENAME TO m_dim;
 
 
 
 
 
 Build m_dim_next schema while users are seeing m_dim - Data shown in frontend is always consistent - When ETL breaks, data just gets old
 Explicit operations determine structure and content of DWH - Very developer / git friendly - Requirement: Must be possible to rebuild DWH from sources - Add incremental loading / processing when there is a pain !6 Drop & rebuild @martin_loetzsch Don’t bother with migrations
  • 7. Renaming / dropping schemas requires exclusive locks 
 CREATE FUNCTION util.cancel_queries_on_schema(schema TEXT) RETURNS BOOLEAN AS $$ SELECT pg_cancel_backend(pid) FROM (SELECT DISTINCT pid FROM pg_locks JOIN pg_database ON database = pg_database.oid JOIN pg_class ON pg_class.oid = relation JOIN pg_namespace ON relnamespace = pg_namespace.oid WHERE datname = current_database() AND nspname = schema AND pid != pg_backend_pid()) t; $$ LANGUAGE SQL;
 
 
 First rename, then drop 
 CREATE FUNCTION util.replace_schema(schemaname TEXT, replace_with TEXT) RETURNS VOID AS $$ BEGIN PERFORM util.cancel_queries_on_schema(schemaname); IF EXISTS(SELECT * FROM information_schema.schemata s WHERE s.schema_name = schemaname) THEN EXECUTE 'ALTER SCHEMA ' || schemaname || ' RENAME TO ' || schemaname || '_old;'; END IF; EXECUTE 'ALTER SCHEMA ' || replace_with || ' RENAME TO ' || schemaname || ';';
 -- again, for good measure PERFORM util.cancel_queries_on_schema(schemaname);
 EXECUTE 'DROP SCHEMA IF EXISTS ' || schemaname || '_old CASCADE;'; END; $$ LANGUAGE 'plpgsql'
 
 
 
 
 Nice one-liner 
 SELECT util.replace_schema('m_dim', 'm_dim_next'); !7 Atomic & robust schema switches @martin_loetzsch Transactional DDL FTW
  • 8. Each pipeline maintains their own schemas - Never write to schemas of other pipelines - Only read from “public” schemas of other pipelines
 
 Project A schema naming conventions mp_dim: public tables of pipeline my-pipeline mp_dim_next: schema for building the next version of mp_dim mp_tmp: temporary tables that are not considered public mp_data: tables that are not always dropped (for incremental loading)
 
 Optional: copy “public” schema from ETL db to frontend db !8 Schemas & pipelines @martin_loetzsch Sometimes confusing for analysts
  • 9. 128 GB machine, highly overcommitted memory 
 max_connections = 200
 temp_buffers = 2GB work_mem = 2GB
 
 
 
 Postgres gracefully restarts when it can’t allocate more memory 
 LOG: unexpected EOF on client connection with an open transaction LOG: server process (PID 17697) was terminated by signal 9: Killed DETAIL: Failed process was running: SELECT m_tmp.index_tmp_touchpoint(); LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. DROP SCHEMA x CASCADE; 
 max_locks_per_transaction = 4000
 
 
 
 
 
 Speed up IO by crippling WAL 
 wal_level = minimal fsync = off synchronous_commit = off full_page_writes = off wal_buffers = -1 No protection against hardware failure! !9 Tuning Postgres for drop & rebuild @martin_loetzsch Not recommended for any other use case
  • 10. Running queries 
 cat query.sql 
 | PGOPTIONS=--client-min-messages=warning psql --no-psqlrc 
 --set ON_ERROR_STOP=on etl_db
 
 echo "SELECT s.my_function()" 
 | PGOPTIONS=--client-min-messages=warning psql --no-psqlrc 
 --set ON_ERROR_STOP=on etl_db
 
 
 
 
 Loading files 
 unzip -p data.csv 
 | python mapper_script.py 
 | PGOPTIONS=--client-min-messages=warning psql --no-psqlrc 
 --set ON_ERROR_STOP=on etl_db --command="COPY s.target_table FROM STDIN"
 Loading from other databases 
 cat source-query.sql 
 | mysql --skip-column-names source_db 
 | PGOPTIONS=--client-min-messages=warning psql --no-psqlrc 
 --set ON_ERROR_STOP=on etl_db --command="COPY s.target_table FROM STDIN WITH NULL AS 'NULL'"
 
 
 
 
 
 Don’t use ORMS / client libraries
 Use “/usr/bin/env bash -o pipefail” as shell !10 Main interface to Postgres: psql @martin_loetzsch Hassle-free & fast
  • 12. It’s easy to make mistakes during ETL 
 DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s;
 
 CREATE TABLE s.city ( city_id SMALLINT, city_name TEXT, country_name TEXT );
 INSERT INTO s.city VALUES (1, 'Berlin', 'Germany'), (2, 'Budapest', 'Hungary');
 
 CREATE TABLE s.customer ( customer_id BIGINT, city_fk SMALLINT );
 INSERT INTO s.customer VALUES (1, 1), (1, 2), (2, 3);
 Customers per country? 
 SELECT country_name, count(*) AS number_of_customers FROM s.customer JOIN s.city 
 ON customer.city_fk = s.city.city_id GROUP BY country_name;
 
 Back up all assumptions about data by constraints 
 ALTER TABLE s.city ADD PRIMARY KEY (city_id); ALTER TABLE s.city ADD UNIQUE (city_name); ALTER TABLE s.city ADD UNIQUE (city_name, country_name); 
 ALTER TABLE s.customer ADD PRIMARY KEY (customer_id); [23505] ERROR: could not create unique index "customer_pkey" Detail: Key (customer_id)=(1) is duplicated.
 ALTER TABLE s.customer ADD FOREIGN KEY (city_fk) REFERENCES s.city (city_id); [23503] ERROR: insert or update on table "customer" violates foreign key constraint "customer_city_fk_fkey" Detail: Key (city_fk)=(3) is not present in table "city" !12 Referential consistency @martin_loetzsch Only very little overhead, will save your ass
  • 13. 10/18/2017 2017-10-18-dwh-schema-pav.svg customer customer_id first_order_fk favourite_product_fk lifetime_revenue product product_id revenue_last_6_months order order_id processed_order_id customer_fk product_fk revenue Never repeat “business logic” 
 SELECT sum(total_price) AS revenue FROM os_data.order WHERE status IN ('pending', 'accepted', 'completed',
 'proposal_for_change'); 
 
 SELECT CASE WHEN (status <> 'started' AND payment_status = 'authorised' AND order_type <> 'backend') THEN o.order_id END AS processed_order_fk FROM os_data.order;
 
 SELECT (last_status = 'pending') :: INTEGER AS is_unprocessed FROM os_data.order;
 
 
 
 
 Refactor pipeline Create separate task that computes everything we know about an order Usually difficult in real life
 
 
 
 
 
 Load → preprocess → transform → flatten-fact !13 Computational consistency @martin_loetzsch Requires discipline load-product load-order load-customer preprocess-product preprocess-order preprocess-customer transform-product transform-order transform-customer flatten-product-fact flatten-order-fact flatten-customer-fact
  • 14. CREATE FUNCTION m_tmp.normalize_utm_source(TEXT) RETURNS TEXT AS $$ SELECT CASE WHEN $1 LIKE '%.%' THEN lower($1) WHEN $1 = '(direct)' THEN 'Direct' WHEN $1 LIKE 'Untracked%' OR $1 LIKE '(%)' THEN $1 ELSE initcap($1) END; $$ LANGUAGE SQL IMMUTABLE;
 CREATE FUNCTION util.norm_phone_number(phone_number TEXT) RETURNS TEXT AS $$ BEGIN phone_number := TRIM(phone_number); phone_number := regexp_replace(phone_number, '(0)', ''); phone_number := regexp_replace(phone_number, '[^[:digit:]]', '', 'g'); phone_number := regexp_replace(phone_number, '^(+49|0049|49)', '0'); phone_number := regexp_replace(phone_number, '^(00)', ''); phone_number := COALESCE(phone_number, ''); RETURN phone_number; END; $$ LANGUAGE PLPGSQL IMMUTABLE;
 CREATE FUNCTION m_tmp.compute_ad_id(id BIGINT, api m_tmp.API) RETURNS BIGINT AS $$ -- creates a collision free ad id from an id in a source system SELECT ((CASE api WHEN 'adwords' THEN 1 WHEN 'bing' THEN 2 WHEN 'criteo' THEN 3 WHEN 'facebook' THEN 4 WHEN 'backend' THEN 5 END) * 10 ^ 18) :: BIGINT + id $$ LANGUAGE SQL IMMUTABLE;
 
 
 CREATE FUNCTION pv.date_to_supplier_period_start(INTEGER) RETURNS INTEGER AS $$ -- this maps all dates to either a integer which is included -- in lieferantenrabatt.period_start or -- null (meaning we don't have a lieferantenrabatt for it) SELECT CASE WHEN $1 >= 20170501 THEN 20170501 WHEN $1 >= 20151231 THEN 20151231 ELSE 20151231 END; $$ LANGUAGE SQL IMMUTABLE; !14 When not possible: use functions @martin_loetzsch Almost no performance overhead
  • 15. Check for “lost” rows 
 SELECT util.assert_equal( 'The order items fact table should contain all order items', 'SELECT count(*) FROM os_dim.order_item', 'SELECT count(*) FROM os_dim.order_items_fact');
 
 
 
 Check consistency across cubes / domains 
 SELECT util.assert_almost_equal( 'The number of first orders should be the same in '
 || 'orders and marketing touchpoints cube', 'SELECT count(net_order_id) FROM os_dim.order WHERE _net_order_rank = 1;',
 'SELECT (SELECT sum(number_of_first_net_orders) FROM m_dim.acquisition_performance) / (SELECT count(*) FROM m_dim.performance_attribution_model)', 1.0 );
 Check completeness of source data 
 SELECT util.assert_not_found( 'Each adwords campaign must have the attribute "Channel"', 'SELECT DISTINCT campaign_name, account_name FROM aw_tmp.ad JOIN aw_dim.ad_performance ON ad_fk = ad_id WHERE attributes->>''Channel'' IS NULL AND impressions > 0 AND _date > now() - INTERVAL ''30 days''');
 
 Check correctness of redistribution transformations 
 SELECT util.assert_almost_equal_relative( 'The cost of non-converting touchpoints must match the' || 'redistributed customer acquisition and reactivation cost', 'SELECT sum(cost) FROM m_tmp.cost_of_non_converting_touchpoints;', 'SELECT (SELECT sum(cost_per_touchpoint * number_of_touchpoints) FROM m_tmp.redistributed_customer_acquisition_cost) + (SELECT sum(cost_per_touchpoint * number_of_touchpoints) FROM m_tmp.redistributed_customer_reactivation_cost);', 0.00001); !15 Data consistency checks @martin_loetzsch Makes changing things easy
  • 16. Execute queries and compare results
 CREATE FUNCTION util.assert(description TEXT, query TEXT) RETURNS BOOLEAN AS $$ DECLARE succeeded BOOLEAN; BEGIN EXECUTE query INTO succeeded; IF NOT succeeded THEN RAISE EXCEPTION 'assertion failed: # % # %', description, query; END IF; RETURN succeeded; END $$ LANGUAGE 'plpgsql';
 
 
 
 CREATE FUNCTION util.assert_almost_equal_relative( description TEXT, query1 TEXT, query2 TEXT, percentage DECIMAL) RETURNS BOOLEAN AS $$ DECLARE result1 NUMERIC; result2 NUMERIC; succeeded BOOLEAN; BEGIN EXECUTE query1 INTO result1; EXECUTE query2 INTO result2; EXECUTE 'SELECT abs(' || result2 || ' - ' || result1 || ') / ' || result1 || ' < ' || percentage INTO succeeded; IF NOT succeeded THEN RAISE WARNING '% assertion failed: abs(% - %) / % < % %: (%) %: (%)', description, result2, result1, result1, percentage, result1, query1, result2, query2; END IF; RETURN succeeded; END $$ LANGUAGE 'plpgsql'; !16 Consistency check functions @martin_loetzsch Also: assert_not_found, assert_equal_table, assert_smaller_than_or_equal
  • 17. Yes, unit tests SELECT util.assert_value_equal('test_german_number_with_country_prefix', util.norm_phone_number('00491234'), '01234'); SELECT util.assert_value_equal('test_german_number_not_to_be_confused_with_country_prefix', util.norm_phone_number('0491234'), '0491234'); SELECT util.assert_value_equal('test_non_german_number_with_plus', util.norm_phone_number('+44 1234'), '441234'); SELECT util.assert_value_equal('test_german_number_with_prefix_and_additional_zero', util.norm_phone_number('+49 (0)1234'), '01234'); SELECT util.assert_value_equal('test__trim', util.norm_phone_number(' 0491234 '), '0491234'); SELECT util.assert_value_equal('test_number_with_leading_wildcard_symbol', util.norm_phone_number('*+436504834933'), '436504834933'); SELECT util.assert_value_equal('test_NULL', util.norm_phone_number(NULL), ''); SELECT util.assert_value_equal('test_empty', util.norm_phone_number(''), ''); SELECT util.assert_value_equal('test_wildcard_only', util.norm_phone_number('*'), ''); SELECT util.assert_value_equal('test_foreign_number_with_two_leading_zeroes', util.norm_phone_number('*00436769553701'), '436769553701'); SELECT util.assert_value_equal('test_domestic_number_with_trailing_letters', util.norm_phone_number('017678402HORST'), '017678402'); SELECT util.assert_value_equal('test_domestic_number_with_leading_letters', util.norm_phone_number('HORST017678402'), '017678402'); SELECT util.assert_value_equal('test_domestic_number_with_letters_in_between', util.norm_phone_number('0H1O7R6S7T8402'), '017678402'); SELECT util.assert_value_equal('test_german_number_with_country_prefix_and_leading_letters', util.norm_phone_number('HORST00491234'), '01234'); SELECT util.assert_value_equal(‘test_german_number_not_to_be_confused_with_country_prefix_and_leading_letters', util.norm_phone_number('HORST0491234'), '0491234'); SELECT util.assert_value_equal('test_non_german_number_with_plus_and_leading_letters', util.norm_phone_number('HORST+44 1234'), '441234'); SELECT util.assert_value_equal('test_german_number_with_prefix_and_additional_zero_and_leading_letters', util.norm_phone_number('HORST+49 (0)1234'), ‘01234'); !17 Unit tests @martin_loetzsch People enter horrible telephone numbers into websites
  • 18. Contribution margin 3a SELECT order_item_id, ((((((COALESCE(item_net_price, 0)::REAL + COALESCE(net_shipping_revenue, 0)::REAL) - ((COALESCE(item_net_purchase_price, 0)::REAL + COALESCE(alcohol_tax, 0)::REAL) + COALESCE(import_tax, 0)::REAL)) - (COALESCE(net_fulfillment_costs, 0)::REAL + COALESCE(net_payment_costs, 0)::REAL)) - COALESCE(net_return_costs, 0)::REAL) - ((COALESCE(item_net_price, 0)::REAL + COALESCE(net_shipping_revenue, 0)::REAL) - ((((COALESCE(item_net_price, 0)::REAL + COALESCE(item_tax_amount, 0)::REAL) + COALESCE(gross_shipping_revenue, 0)::REAL) - COALESCE(voucher_gross_amount, 0)::REAL) * (1 - ((COALESCE(item_tax_amount, 0)::REAL + (COALESCE(gross_shipping_revenue, 0)::REAL - COALESCE(net_shipping_revenue, 0)::REAL)) / NULLIF(((COALESCE(item_net_price, 0)::REAL + COALESCE(item_tax_amount, 0)::REAL) + COALESCE(gross_shipping_revenue, 0)::REAL), 0)))))) - COALESCE(goodie_cost_per_item, 0)::REAL) :: DOUBLE PRECISION AS "Contribution margin 3a" FROM dim.sales_fact; Use schemas between reporting and database Mondrian LookerML your own Or: Pre-compute metrics in database !18 Semantic consistency @martin_loetzsch Changing the meaning of metrics across all dashboards needs to be easy
  • 19. Divide & conquer strategies @martin_loetzsch !19
  • 20. Create two tables, fill with 10M values 
 DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s; 
 
 CREATE TABLE s.table_1 ( user_id BIGINT, some_number INTEGER); 
 INSERT INTO s.table_1 SELECT n, n FROM generate_series(1, 10000000) n; 
 
 CREATE TABLE s.table_2 AS SELECT * FROM s.table_1;
 Join both tables 
 EXPLAIN ANALYZE SELECT * FROM s.table_1 JOIN s.table_2 USING (user_id);
 
 Let’s assume 18 seconds is slow 
 Merge Join (rows=10000000)
 Merge Cond: (table_1.user_id = table_2.user_id)
 -> Sort (rows=10000000)
 Sort Key: table_1.user_id
 Sort Method: external merge Disk: 215032kB
 -> Seq Scan on table_1 (rows=10000000)
 -> Materialize (rows=10000000)
 -> Sort (rows=10000000)
 Sort Key: table_2.user_id
 Sort Method: external merge Disk: 254144kB
 -> Seq Scan on table_2 (rows=10000000)
 Planning time: 0.700 ms
 Execution time: 18278.520 ms !20 Joining large tables can be slow @martin_loetzsch Think of: join all user touchpoints with all user orders
  • 21. Create table_1 with 5 partitions 
 DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s; 
 CREATE TABLE s.table_1 ( user_id BIGINT NOT NULL, user_chunk SMALLINT NOT NULL, some_number INTEGER NOT NULL); 
 
 CREATE TABLE s.table_1_0 (CHECK (user_chunk = 0)) INHERITS (s.table_1); 
 CREATE TABLE s.table_1_1 (CHECK (user_chunk = 1)) INHERITS (s.table_1); 
 CREATE TABLE s.table_1_2 (CHECK (user_chunk = 2)) INHERITS (s.table_1); 
 CREATE TABLE s.table_1_3 (CHECK (user_chunk = 3)) INHERITS (s.table_1); 
 CREATE TABLE s.table_1_4 (CHECK (user_chunk = 4)) INHERITS (s.table_1);
 Insert directly into partition (don’t use triggers etc.) 
 INSERT INTO s.table_1_0 SELECT n, n % 5, n from generate_series(0, 10000000, 5) n; 
 Automate insertion 
 CREATE FUNCTION s.insert_data(table_name TEXT, chunk SMALLINT) RETURNS VOID AS $$ BEGIN EXECUTE 'INSERT INTO ' || table_name || '_' || chunk || ' SELECT n, n % 5, n from generate_series(‘ || chunk || ', 10000000, 5) n'; EXECUTE 'ANALYZE ' || table_name || '_' || chunk; END; $$ LANGUAGE plpgsql;
 Run in parallel (depends on ETL framework) 
 SELECT s.insert_data('s.table_1', 1 :: SMALLINT); SELECT s.insert_data('s.table_1', 2 :: SMALLINT); SELECT s.insert_data('s.table_1', 3 :: SMALLINT); SELECT s.insert_data('s.table_1', 4 :: SMALLINT); !21 Splitting data in chunks / partitions @martin_loetzsch user chunk = user_id % 5
  • 22. Build DDL statement and execute it 
 CREATE FUNCTION s.create_table_partitions(schemaname TEXT, tablename TEXT, key_column TEXT, keys TEXT[]) RETURNS VOID AS $$ DECLARE key TEXT; BEGIN FOREACH KEY IN ARRAY keys LOOP IF NOT EXISTS(SELECT 1 FROM information_schema.tables t WHERE t.table_schema = schemaname AND t.table_name = tablename || '_' || key) THEN EXECUTE 'CREATE TABLE ' || schemaname || '.' || tablename || '_' || key || ' ( CHECK (' || key_column || ' = ' || key || ') ) INHERITS (' || schemaname || '.' || tablename || ');'; END IF; END LOOP; END $$ LANGUAGE plpgsql;
 Create table_2 
 CREATE TABLE s.table_2 ( user_id BIGINT NOT NULL, user_chunk SMALLINT NOT NULL, some_number INTEGER NOT NULL ); 
 SELECT s.create_table_partitions( 's', 'table_2', 'user_chunk', (SELECT array(SELECT n :: TEXT FROM generate_series(0, 4) n)));
 
 
 
 
 Run in parallel (depends on ETL framework): 
 SELECT s.insert_data('s.table_2', 0 :: SMALLINT); SELECT s.insert_data('s.table_2', 1 :: SMALLINT); SELECT s.insert_data('s.table_2', 2 :: SMALLINT); SELECT s.insert_data('s.table_2', 3 :: SMALLINT); SELECT s.insert_data('s.table_2', 4 :: SMALLINT); !22 Automating partition management @martin_loetzsch Also great: pg_partman, Postgres 10 declarative partitioning
  • 23. Join table_1 and table_2 for chunk 0 
 EXPLAIN ANALYZE SELECT user_id, 0 AS user_chunk, table_1.some_number + table_2.some_number AS sum FROM s.table_1 JOIN s.table_2 USING (user_id) WHERE table_1.user_chunk = 0 AND table_2.user_chunk = 0;
 Similar to original join (18 seconds), but almost 5x faster 
 Merge Join Merge Cond: (table_1.user_id = table_2.user_id) -> Sort (rows=2000001) Sort Key: table_1.user_id Sort Method: external merge Disk: 43008kB -> Append (rows=2000001) -> Seq Scan on table_1 (rows=0) Filter: (user_chunk = 0) -> Seq Scan on table_1_0 (rows=2000001) Filter: (user_chunk = 0) -> Sort (rows=2000001) Sort Key: table_2.user_id Sort Method: external sort Disk: 58664kB -> Append (rows=2000001) -> Seq Scan on table_2 (rows=0) Filter: (user_chunk = 0) -> Seq Scan on table_2_0 (rows=2000001) Filter: (user_chunk = 0) Planning time: 5.171 ms Execution time: 3901.282 ms !23 Querying chunked data @martin_loetzsch
  • 24. No chunk restriction on table_2 
 EXPLAIN ANALYZE SELECT user_id, 0 AS user_chunk, table_1.some_number + table_2.some_number AS sum FROM s.table_1 JOIN s.table_2 USING (user_id) WHERE table_1.user_chunk = 0;
 Append operator quite costly 
 Merge Join (rows=2000001) Merge Cond: (table_1.user_id = table_2.user_id) -> Sort (rows=2000001) Sort Key: table_1.user_id Sort Method: external merge Disk: 43008kB -> Append (rows=2000001) -> Seq Scan on table_1 (rows=0) Filter: (user_chunk = 0) -> Seq Scan on table_1_0 (rows=2000001) Filter: (user_chunk = 0) -> Materialize (rows=10000001) -> Sort (rows=10000001) Sort Key: table_2.user_id Sort Method: external merge Disk: 254144kB -> Append (rows=10000001) -> Seq Scan on table_2 (rows=0) -> Seq Scan on table_2_0 (rows=2000001) -> Seq Scan on table_2_1 (rows=2000000) -> Seq Scan on table_2_2 (rows=2000000) -> Seq Scan on table_2_3 (rows=2000000) -> Seq Scan on table_2_4 (rows=2000000) Planning time: 0.371 ms Execution time: 11654.858 ms !24 Common mistake @martin_loetzsch Check for sequence scans on complete partitioned tables
  • 25. Create partitioned table for target of computation 
 CREATE TABLE s.table_3 ( user_id BIGINT NOT NULL, user_chunk SMALLINT NOT NULL, sum INTEGER NOT NULL);
 SELECT s.create_table_partitions( 's', 'table_3', 'user_chunk', (SELECT array(SELECT n :: TEXT FROM generate_series(0, 4) n)));
 A function for the computation 
 CREATE FUNCTION s.join_table_1_and_table_2(chunk SMALLINT) RETURNS SETOF s.table_3 AS $$ BEGIN RETURN QUERY SELECT user_id, table_1.some_number + table_2.some_number AS sum, chunk FROM s.table_1 JOIN s.table_2 USING (user_id) WHERE table_1.user_chunk = chunk AND table_2.user_chunk = chunk; END $$ LANGUAGE plpgsql; Insert from a separate function 
 CREATE OR REPLACE FUNCTION s.insert_table_3(chunk INTEGER) RETURNS VOID AS $$ BEGIN EXECUTE 'INSERT INTO s.table_3_' || chunk || ' (SELECT * FROM s.join_table_1_and_table_2(' || chunk || '::SMALLINT))'; END $$ LANGUAGE plpgsql;
 
 
 
 
 
 Run in parallel (depends on ETL framework) 
 SELECT s.insert_table_3(0); SELECT s.insert_table_3(1); SELECT s.insert_table_3(2); SELECT s.insert_table_3(3); SELECT s.insert_table_3(4); !25 Using functions for chunked processing @martin_loetzsch Our best practices, depends on ETL framework
  • 26. read ga session map ga visitor preprocess adwords ad preprocess criteo campaign preprocess facebook ad set transform device transform host transform landing page transform region update campaign tree preprocess touchpoint preprocess ad preprocess campaign tree map manual cost transform ad attribute transform ad transform acquisition performance transform reactivation performance count converting touchpoints transform marketing cost compute first touchpoint create performance attribution model read manual cost flatten touchpoints fact flatten marketing cost fact compute redistributed customer acquisition cost compute redistributed customer reactivation cost collect non converting cost compute touchpoint cost transform touchpoint Choose chunks depending on things that are processed together Use secondary indexes on tables that are processed in different chunks Re-chunk on chunk “borders”
 
 Real world example 
 CREATE FUNCTION m_tmp.insert_dim_touchpoint(visitor_chunk SMALLINT) RETURNS VOID AS $$ DECLARE day_chunk SMALLINT; BEGIN FOR day_chunk IN (SELECT util.get_all_chunks()) LOOP EXECUTE 'INSERT INTO m_dim_next.touchpoint_' || visitor_chunk || ' (SELECT * FROM m_tmp.transform_touchpoint(' || visitor_chunk || '::SMALLINT, ' || day_chunk || '::SMALLINT))'; END LOOP; END $$ LANGUAGE plpgsql; !26 Chunking all the way down @martin_loetzsch Takes discipline day day_chunk visitor_chunk attribution_model
  • 27. Make changing chunk size easy 
 CREATE FUNCTION s.get_number_of_chunks() RETURNS SMALLINT AS $$ SELECT 5 :: SMALLINT; $$ LANGUAGE SQL IMMUTABLE; 
 CREATE FUNCTION s.get_all_chunks() RETURNS SETOF INTEGER AS $$ SELECT generate_series(0, s.get_number_of_chunks() - 1); $$ LANGUAGE SQL IMMUTABLE; 
 CREATE FUNCTION s.compute_chunk(x BIGINT) RETURNS SMALLINT AS $$ SELECT coalesce(abs(x) % s.get_number_of_chunks(), 0) :: SMALLINT; $$ LANGUAGE SQL IMMUTABLE; Inline chunk computation 
 DROP FUNCTION s.compute_chunk( BIGINT ); 
 DO $$ BEGIN EXECUTE ' CREATE FUNCTION s.compute_chunk(x BIGINT) RETURNS SMALLINT AS $_$ SELECT coalesce(abs(x) % ' || s.get_number_of_chunks() || ', 0)::SMALLINT; $_$ LANGUAGE SQL IMMUTABLE;'; END $$; !27 Configurable chunking @martin_loetzsch Integrate with ETL framework
  • 28. Becoming friends with the query planner @martin_loetzsch !28
  • 29. Join a table with 10M values with a table with 10K values 
 DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s; 
 
 CREATE TABLE s.table_1 AS SELECT a FROM generate_series(1, 10000) a; 
 CREATE TABLE s.table_2 AS SELECT a FROM generate_series(1, 10000000) a; 
 CREATE INDEX table_2__a ON s.table_2 (a); 
 
 EXPLAIN ANALYZE SELECT * FROM s.table_2 JOIN s.table_1 USING (a);
 Index on table_2 is not used 
 Merge Join (rows=573750000) (actual rows=10000) Merge Cond: (table_1.a = table_2.a) -> Sort (rows=11475) (actual rows=10000) Sort Key: table_1.a Sort Method: quicksort Memory: 853kB -> Seq Scan on table_1 (rows=11475) (actual rows=10000) -> Materialize (rows=10000000) (actual rows=10001) -> Sort (rows=10000000) (actual rows=10001) Sort Key: table_2.a Sort Method: external merge Disk: 136816kB -> Seq Scan on table_2 (rows=10000000) (actual rows=10000000) Planning time: 2.173 ms Execution time: 3071.127 ms
 
 The query planner didn’t know that table_2 has much more distinct values than table_1 Statistics about cardinality and distribution of data are collected by asynchronous autovacuum daemon !29 Nondeterministic behavior? @martin_loetzsch Queries sometimes don’t terminate when run in ETL, but look fine when run separately
  • 30. Manually collect data statistics before using new tables 
 DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s; 
 
 CREATE TABLE s.table_1 AS SELECT a FROM generate_series(1, 10000) a; 
 CREATE TABLE s.table_2 AS SELECT a FROM generate_series(1, 10000000) a; 
 CREATE INDEX table_2__a ON s.table_2 (a); 
 
 ANALYZE s.table_1; ANALYZE s.table_2; 
 
 EXPLAIN ANALYZE SELECT * FROM s.table_2 JOIN s.table_1 USING (a);
 > 300 x faster 
 Merge Join (rows=10000) (actual rows=10000) Merge Cond: (table_2.a = table_1.a) -> Index Only Scan using table_2__a on table_2 (rows=9999985) (actual rows=10001) Heap Fetches: 10001 -> Sort (rows=10000) (actual rows=10000) Sort Key: table_1.a Sort Method: quicksort Memory: 853kB -> Seq Scan on table_1 (rows=10000) (rows=10000) Planning time: 1.759 ms Execution time: 9.287 ms !30 Analyze all the things @martin_loetzsch Always analyze newly created tables
  • 31. Add to postgresql.conf session_preload_libraries = 'auto_explain' auto_explain.log_min_duration = 10 auto_explain.log_nested_statements = true auto_explain.log_verbose = true auto_explain.log_analyze = true debug_pretty_print = on log_lock_waits = on Carefully manage log file size !31 Auto-explain everything @martin_loetzsch tail -f /path/to/postgres.log
  • 32. Columnar storage in PostgreSQL @martin_loetzsch !32
  • 34. Initialize extension 
 CREATE EXTENSION IF NOT EXISTS cstore_fdw; 
 DO $$ BEGIN IF NOT (SELECT exists(SELECT 1 FROM pg_foreign_server WHERE srvname = 'cstore_server')) THEN CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw; END IF; END $$;
 
 Create fact table 
 DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s; 
 CREATE FOREIGN TABLE s.fact ( fact_id BIGSERIAL, a TEXT, b TEXT ) SERVER cstore_server OPTIONS (compression 'pglz'); Insert 10M rows 
 INSERT INTO s.fact SELECT n, ('a ' || 1 + (4 * random()) :: INTEGER) AS a, ('b ' || 1 + (4 * random()) :: INTEGER) AS b FROM generate_series(0, 10000000) n ORDER BY a, b;
 
 
 
 8% of original size 
 SELECT pg_size_pretty(pg_table_size(’s.fact')); 
 37 MB !34 Save disk space with cstore tables @martin_loetzsch Optimized Row Columnar (ORC) format
  • 35. Typical OLAP query 
 EXPLAIN ANALYZE SELECT a, count(*) AS count FROM s.fact WHERE b = 'b 1' GROUP BY a;
 
 5 x faster than without store 
 HashAggregate (rows=5) Group Key: a -> Foreign Scan on fact (rows=1248701) Filter: (b = 'b 1'::text) Rows Removed by Filter: 41299 CStore File: /postgresql/9.6/cstore_fdw/836136/849220 CStore File Size: 39166772 Planning time: 10.057 ms Execution time: 257.107 ms
 Highly selective query 
 EXPLAIN ANALYZE SELECT count(*) AS count FROM s.fact WHERE a = 'a 1' AND b = 'b 1’;
 
 
 Really fast 
 Aggregate (rows=1) -> Foreign Scan on fact (rows=156676) Filter: ((a = 'a 1'::text) AND (b = 'b 1'::text)) Rows Removed by Filter: 13324 CStore File: /postgresql/9.6/cstore_fdw/836136/849220 CStore File Size: 39166772 Planning time: 5.492 ms Execution time: 31.689 ms !35 Main operation: foreign scan @martin_loetzsch Aggregation pushdown coming soon
  • 36. Consider PostgreSQL for your next ETL project @martin_loetzsch !36
  • 37. Focus on the complexity of data rather than the complexity of technology @martin_loetzsch !37
  • 38. !38 We open sourced our BI infrastructure @martin_loetzsch MIT licensed, heavily relies on PostgreSQL
  • 39. Avoid click-tools hard to debug hard to change hard to scale with team size/ data complexity / data volume 
 Data pipelines as code SQL files, python & shell scripts Structure & content of data warehouse are result of running code 
 Easy to debug & inspect Develop locally, test on staging system, then deploy to production !39 Make changing and testing things easy @martin_loetzsch Apply standard software engineering best practices Megabytes Plain scripts Petabytes Apache Airflow In between Mara
  • 40. Example pipeline 
 pipeline = Pipeline(id='demo', description='A small pipeline ..’)
 pipeline.add( Task(id='ping_localhost', description='Pings localhost', commands=[RunBash('ping -c 3 localhost')])) sub_pipeline = Pipeline(id='sub_pipeline', description='Pings ..') for host in ['google', 'amazon', 'facebook']: sub_pipeline.add( Task(id=f'ping_{host}', description=f'Pings {host}', commands=[RunBash(f'ping -c 3 {host}.com')])) sub_pipeline.add_dependency('ping_amazon', 'ping_facebook') sub_pipeline.add(Task(id='ping_foo', description='Pings foo', commands=[RunBash('ping foo')]), upstreams=['ping_amazon']) pipeline.add(sub_pipeline, upstreams=['ping_localhost']) pipeline.add(Task(id=‘sleep', description='Sleeps for 2 seconds', commands=[RunBash('sleep 2')]), upstreams=[‘sub_pipeline’]) !40 ETL pipelines as code @martin_loetzsch Pipeline = list of tasks with dependencies between them. Task = list of commands
  • 41. Execute query 
 ExecuteSQL(sql_file_name=“preprocess-ad.sql") cat app/data_integration/pipelines/facebook/preprocess-ad.sql | PGTZ=Europe/Berlin PGOPTIONS=—client-min-messages=warning psql --username=mloetzsch --host=localhost --echo-all -—no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl
 Read file 
 ReadFile(file_name=“country_iso_code.csv", compression=Compression.NONE, target_table="os_data.country_iso_code", mapper_script_file_name=“read-country-iso-codes.py", delimiter_char=“;") cat "dwh-data/country_iso_code.csv" | .venv/bin/python3.6 "app/data_integration/pipelines/load_data/ read-country-iso-codes.py" | PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning psql --username=mloetzsch --host=localhost --echo-all --no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl --command="COPY os_data.country_iso_code FROM STDIN WITH CSV DELIMITER AS ';'"
 Copy from other databases 
 Copy(sql_file_name="pdm/load-product.sql", source_db_alias=“pdm", target_table=“os_data.product", replace={"@@db@@": "K24Pdm", "@@dbschema@@": “ps", "@@client@@": "kfzteile24 GmbH"})
 cat app/data_integration/pipelines/load_data/pdm/load-product.sql | sed "s/@@db@@/K24Pdm/g;s/@@dbschema@@/ps/g;s/@@client@@/ kfzteile24 GmbH/g" | sed 's/$/$/g;s/$/$/g' | (cat && echo ';') | (cat && echo '; go') | sqsh -U ***** -P ******* -S ******* -D K24Pdm -m csv | PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning psql --username=mloetzsch --host=localhost --echo-all --no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl --command="COPY os_data.product FROM STDIN WITH CSV HEADER" !41 Shell commands as interface to data & DBs @martin_loetzsch Nothing is faster than a unix pipe
  • 42. Runnable app Integrates PyPI project download stats with 
 Github repo events !42 Try it out: Python project stats data warehouse @martin_loetzsch https://github.com/mara/mara-example-project
  • 43. !43 Refer us a data person, earn 400€ @martin_loetzsch Also analysts, developers, product managers