Etl confessions pg conf us 2017

ETL Confessions
PgConf US 2017
Corey Huinker

About Me
Database Programmer For 20 Years
1st Benchmark: TPC-C for AS/400
PostgreSQL, Oracle, DB2, SQLServer,
Sybase, Informix, SQLite, DB/400... and
mysql.
Dabbled in Improv and Roller Derby

About Moat
Advertising Analytics
Viewability, Reach, Fraud Detection
Tens of Billions of ad events per day
Summary data imports ~500M rows/day
Hiring for Offices NYC, London, SF, LA,
Singapore, Sydney, Austin, Cincinnati,
Miami

Database Workloads - OLTP
● On-Line Transaction Processing
● also known as Short Request
● Very High concurrency
● Most queries are very small in results fetched, data updated
● New data is essentially random
● Locking issues, latency paramount

Database Workloads - Data Warehouse
● Usually derivative data
○ Either from in-house OLTP databases
○ and other data sources
● Latency tolerance much higher
● Data imported in batches with something in common
○ data from specific time-frame
○ data from specific OLTP shards
○ externally sourced data
● Query throughput more important than any one query's individual
performance
● Data is stored in ways optimal for reading, not updating.

Extract, Transform, Load
● Extract
○ Get the data from the external sources
■ Other databases under our control
■ Data pulls from websites
■ Publicly available data
○ Maybe you control the format, maybe you don't
● Transform
○ Scan the data for correct formatting, valid values, etc
■ Discard? Fix?
○ Reformat the data to the shape of the table(s) you wish to load it.
● Load
○ Get it in the Data Warehouse as fast as possible

Toolsets - External Programs
● Homegrown data file reading, one INSERT at a time.
○ Read data from external data source
○ organize that data into rows
○ insert, one row at a time
○ Didn't I write this once in high school?
○ Client libraries offer conveniences like execute_many()
■ Client libraries lie
■ They lie, and they loop

Toolsets - Third Party Loader Programs
● Command line
○ pg_loader, pg_bulkload on PgSQL
○ SQL*Loader (Oracle)
○ bcp (Sybase/SQLServer)
● GUI tools
○ Kettle (Pentaho)
○ SSIS (Microsoft)
○ DataStage (IBM)
○ Informatica (Sauron)

Third Party Loader Programs - Pros
● Custom designed for many common solved problems
○ Never parse a CSV again
● Usually aware of bulk loading facilities in the target database
○ The fewer target databases, the more likely they're optimized (beware
vendor bias)
● Simple things are simple

Third Party Loader Programs - Cons
● If you want all your app logic logic in one place, you're probably out of luck
● Non-simple things can be impossible
○ Custom code often back to single row inserts
○ plugins usually involve writing in the language that the tool was written
in (Visual C++, Java) rather than your own core competencies
● Graphical interfaces conceal application logic
○ un grep-able
● "Code" often stored in XML/binary, resistant to source control
● Biz logic that requires existing database state requires local copy of that
state
○ Re-inventing the LEFT OUTER JOIN
○ Fun with race conditions

Extract, Load, Transform
● Extract - Same as ETL
● Load
○ Make local work tables shaped to suit the data as-is
● Transform
○ Filter - remove rows/columns you know you don't want
○ Validate - test values for correctness
○ Classify - data might be destined for multiple tables, have key
linkages, etc
○ Encode - Date strings become dates, etc.
○ Dereference - Enumerations and "dimensional" data become
associated with existing primary key values or new ones are inserted
○ Insert transformed data into user-facing tables
○ Isn't this just ETL with the 'T' inside the DB? Yes.

ELT - Pros
● Transformation logic is usually written in SQL statements, a skill you
already had internally.
● Easily worked into existing source control.
● Referencing existing data is trivial.
● Referencing existing data is transactional.
● You can reuse a lot of your existing data integrity logic.

ELT - Cons
● You're writing the data to the database twice.
○ Some of that data you don't even want
○ Additional disk space needed for transformation work...temporarily
○ Additional CPU burden during ETL
■ Not a big deal if you use read replicas and take the master out of
UI rotation.
● Some data validation may still require on external factors
● If your data is in very large data files, you may have to split them up
● If your data is in a large number of small files, you may have to use
multiple workers and manage the workload yourself.
○ That workload management is itself an OLTP-type task.

ELT Case 1: COPY a Big File
● The file:
○ Disk size: 40MB, compressed, 1.1GB uncompressed
○ Number of rows: 5160108, including header
○ Number of columns:
■ taxonomy columns (text): 8
■ metrics (bigint): ¯_(ツ)_/¯
● Format changes over time
● producers were not necessarily consistent
● updated data files would follow latest format, which might
conflict with original format

Case 1: Confession 1: Format Discovery
● Dates and taxonomy (text) column names are known and generally fixed,
but don't count on it.
● Any columns with unfamiliar names are assumed to be metrics (bigint)
● COPY just the header row into a one column table and split the row
afterward.
CREATE TEMP TABLE header(header_string text);
COPY header FROM PROGRAM 'zcat filename.csv.gz | head -n 1';
● regexp_split_to_table() on the one-row table

Same as before, but use plpythonu
CREATE OR REPLACE FUNCTION get_header(file_name text)
RETURNS text[]
LANGUAGE PLPYTHONU STRICT SET SEARCH_PATH FROM CURRENT
AS $PYTHON$
import gzip
with gzip.GzipFile(file_name,'rb') as g:
header = g.readline()
if not header:
return None
return header.rstrip().split(',')
$PYTHON$;

Another function to use that one and create a temp table
l_columns := get_csvgz_file_header(l_temp_file);
SELECT string_agg(CASE
WHEN colname = 'load_date' THEN 'load_date date'
WHEN colname in ('taxo1', ..., 'taxo8')
THEN format('%s text', colname)
ELSE format('%I bigint',colname)
END,
', ' ORDER BY col_order)
FROM unnest(l_columns) WITH ORDINALITY AS c(colname,col_order)
INTO l_col_defs;
EXECUTE format('create temporary table work_table (%s)', l_col_defs);
Now the temp table is created in a single transaction with a format to match its
own header and existing business rules.

ELT Case 2: COPY a Big File
● The test machine:
○ AWS EC2 m4.xlarge (4 cores, 16GB RAM)
○ PostgreSQL 10 pre-alpha!
● The production machine:
○ AWS EC2 i3.8xlarge (32 cores, 245GB RAM)
■ Approx 5000 jobs in an ETL cycle
■ Jobs need to play nice parallel-wise.
○ PostgreSQL 9.6
● AWS machines have a reputation for being I/O starved

Case 2: Load Times By Various Methods
● Simplify the test by knowing the format of the data file ahead of time.
○ Datafile: csv, 8 text columns, 123 bigints
○ 5.6M rows

Case 2: Load Method: pgloader
Create a control file to load the data into a regular table.
pgloader: an external program, controlled via command line switches or a
configuration file
load csv from ../d2.csv ( co1, col2, ..., colN )
into postgresql://test_user:test@localhost/pgloader_test?destination
with truncate, skip header = 1, fields optionally enclosed by '"';

Case 2: Load Method: pgloader
Timing
Loading uncompressed .csv:
real 3m33.102s
pigz -d --stdout compressed.csv.gz | pgloader test_stdin.load
real 4m37.150s

Case 2: Load Times By pg_bulkload
Direct load of uncompressed table
real 0m25.601s
pigz piped to pg_bulkload
real 0m36.924s

Case 2: COPY .CSV To Regular Table
COPY destination FROM '../d2.csv' WITH (FORMAT CSV, HEADER)
Time: 24124.291 ms (00:24.124)
pigz --stdout -d ../d2.csv.gz | time psql db_name -c "COPY destination FROM
STDIN WITH (FORMAT CSV, HEADER)"
Time: 24849.039 ms (00:24.849)
COPY destination FROM program 'pigz --stdout -d ../d2.csv.gz' WITH (FORMAT
CSV, HEADER)
Time: 25357.763 ms (00:25.358)

Case 2: COPY .CSV To Unlogged Table
COPY dest_ulog FROM '../d2.csv' WITH (FORMAT CSV, HEADER)
Time: 18714.313 ms (00:18.714)
pigz --stdout -d ../d2.csv.gz | time psql db_name -c "COPY dest_ulog FROM
STDIN WITH (FORMAT CSV, HEADER)"
Time: 19509.254 ms (00:19.509)
COPY dest_unlogged FROM program 'pigz --stdout -d ../d2.csv.gz' WITH
(FORMAT CSV, HEADER)
Time: 19264.937 ms (00:19.265)

Case 2: COPY .CSV To Unlogged Table

Case 2: Conclusion
● COPY FROM PROGRAM performs same as external unix pipes
● pigz only slightly better than gzip -d
● Decompressing .csv.gz is only slight overhead over uncompressed.
● pg_bulkload ineffective in situations where a naive COPY will do.
● pg_loader consumes all available CPU, still slower than all other
methods.

Case 3: Avoid the Bounce Table
CREATE TEMP TABLE bounce_table (...);
COPY bounce_table FROM '...';
ANALYZE bounce_table;
INSERT INTO table_that_actually_matters (...);
SELECT SUM(something), ... FROM bounce_table;
● Reading the file
● Write to temp table
● Read it right back out
● discard the temp table
● Oid churn with the temp table
● Nice if we could just read it as a table

Case 3: Temporary Foreign Table
CREATE SERVER filesystem FOREIGN DATA WRAPPER file_fdw;
CREATE FOREIGN TABLE pg_temp.straight_from_file (...)
SERVER filesystem;
INSERT INTO table_that_actually_matters(...);
SELECT SUM(something), ...
FROM pg_temp.straight_from_file;
● Should eliminate unnecessary writes
● Should halve the number of disk reads
● Allows for filtration using WHERE clause
● Still need to look out for invalid data.
● Still burns an oid
● No such thing as CREATE TEMPORARY FOREIGN TABLE, yet.

● SELECT SUM(c1), ..., SUM(c5) FROM pg_temp.straight_from_file;
Time: 9858.383 ms (00:09.858)
● SELECT SUM(c1), ..., SUM(c20) FROM pg_temp.straight_from_file;
Time: 11701.863 ms (00:11.702)
● COPY TO UNLOGGED TABLE
Time: 17477.338 ms (00:17.477)
● ANALYZE OF UNLOGGED TABLE
Time: 383.597 ms
● SUM 5 COLUMNS
Time: 1265.488 ms (00:01.265)
● SUM 20 COLUMNS
Time: 1992.116 ms (00:01.992)

Case 3: Conclusions
● file_fdw is fastest when most of the columns are ignored.
● timings closer to even as the number of columns referenced approaches all in the table
● explicit analyzing the foreign table has no discernable effect on SELECT performance
● row estimate done via file size.
● COPY + ANALYZE wins if you need to read from that table at least twice.
● reading only once you save 50% of the time, plus saved resources available for other programs.
● file_fdw can use PROGRAM option in v10
● be careful of bad row estimates in using PROGRAM
● COPY_SRF() (failed v10 patch)
○ COPY protocol in SRF form
○ does not burn an oid
○ materializes the entire file
○ Also subject to bad row estimates, but could encapsulate and alter the wrapper function
row estimate
○ Cannot copy from STDIN without a change to wireline protocol

Case 4: Rollup Tables
● One large table with very specific 8-column grain
● Most queries will aggregate to their own specific grain
● Pre-aggregate tables so that specific queries can choose smaller tables
that more directly suit their needs.
● Oracle does this transparently with materialized views.
The Rollups
(a,b,c,d,e,f,g,h) (a,b,c,d)
(a,b,c, e,f,g,h) (a,b,c )
(a,b, e,f,g,h) (a,b )
(a, e,f,g,h) (a )
( e,f,g,h) ( )

Case 4: Dumb Inserts
INSERT INTO d9 SELECT a,b,c,d,e,f,g,h, SUM(m1), ..., SUM(mN)
FROM copied_table GROUP BY a,b,c,d,e,f,g,h;
INSERT INTO d8 SELECT a,b,c, e,f,g,h, SUM(m1), ..., SUM(mN)
FROM copied_table GROUP BY a,b,c, e,f,g,h;
INSERT INTO d7 SELECT a,b, e,f,g,h, SUM(m1), ..., SUM(mN)
FROM copied_table GROUP BY a,b, e,f,g,h;
...
INSERT INTO d1 SELECT a SUM(m1), ..., SUM(mN)
FROM copied_table GROUP BY a;
INSERT INTO d0 SELECT SUM(m1), ..., SUM(mN)
FROM copied_table;

Case 4: Chained Inserts
FROM copied_table GROUP BY a,b,c,d,e,f,g,h;
FROM d9 GROUP BY a,b,c, e,f,g,h;
FROM d8 GROUP BY a,b, e,f,g,h;
...
FROM d2 GROUP BY a;
INSERT INTO d0 SELECT SUM(m1), ..., SUM(mN)
FROM d1;

Case 4: Chained Inserts in a CTE
WITH temp_d9 as (
FROM copied_table GROUP BY a,b,c,d,e,f,g,h RETURNING *),
temp_d8 as (
FROM temp_d9 GROUP BY a,b,c, e,f,g,h RETURNING *),
temp_d7 as (
FROM temp_d8 GROUP BY a,b, e,f,g,h RETURNING *),
...
temp_d0 as (
FROM temp_d1 GROUP BY a RETURNING *)
SELECT (SELECT COUNT(*) FROM temp_d9) as rows_d9,
(SELECT COUNT(*) FROM temp_d8) as rows_d8,
...
(SELECT COUNT(*) FROM temp_d0) as rows_d0;

Case 4: Dumb Rollup Table (1/2)
CREATE TEMP TABLE dumb_rollup (a,b,c,d,e,f,g,h,m1,...,mN,grouping_level)
SELECT a,b,c,d,e,f,g,h, SUM(m1), ..., SUM(mN),
CASE
WHEN grouping(d) = 0 and grouping(h) = 0 THEN 9
WHEN grouping(c) = 0 and grouping(h) = 0 THEN 8
...
WHEN grouping(c) = 0 THEN 3
WHEN grouping(b) = 0 THEN 2
WHEN grouping(a) = 0 THEN 1
ELSE 0
END
FROM copied_table
GROUP BY GROUPING SETS( (a, b, c, d, e, f, g, h),
(a, b, c, d, f, g, h),
...
(a, b),
(a),
() );

Case 4: Dumb Rollup Table (2/2)
INSERT INTO d9
SELECT a,b,c,d,e,f,g,h,m1,...,mN FROM dumb_rollup WHERE grouping_level = 9;
INSERT INTO d8
SELECT a,b,c, e,f,g,h,m1,...,mN FROM dumb_rollup WHERE grouping_level = 8;
...
INSERT INTO d1
SELECT a, m1,...,mN FROM dumb_rollup WHERE grouping_level = 1;
INSERT INTO d0
SELECT m1,...,mN FROM dumb_rollup WHERE grouping_level = 0;

Case 4: Chained CTE Rollup
WITH rollups (a,b,c,d,e,f,g,h,m1,...,mN,grouping_level)
CASE ... END
FROM copied_table
GROUP BY GROUPING SETS( ... ) ),
temp_d9 as (
INSERT INTO d9 SELECT ... FROM rollups WHERE grouping_level = 9 RETURNING NULL),
temp_d8 as (
...
temp_d1 as (
temp_d0 as (
INSERT INTO d8 SELECT ... FROM rollups WHERE grouping_level = 0 RETURNING NULL)
SELECT (SELECT COUNT(*) FROM temp_d9) as rows_d9,
(SELECT COUNT(*) FROM temp_d8) as rows_d8,
...
(SELECT COUNT(*) FROM temp_d0) as rows_d0;

Case 4: Rollup to v10 Native Partition ATR
Instead of progressively removing columns from tables, leave them all the same shape but enforce NULL constraints
CREATE TABLE dest_all ( a ..., m1 ..., grouping_level integer );
CREATE TABLE dest_9 PARTITION OF dest_all FOR VALUES IN (9);
...
INSERT INTO dest_all
CASE ... END AS grouping_level
FROM copied_table
GROUP BY GROUPING SETS( ... ) );

Etl confessions pg conf us 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Etl confessions pg conf us 2017

Similar to Etl confessions pg conf us 2017 (20)

Recently uploaded

Recently uploaded (20)

Etl confessions pg conf us 2017