SlideShare a Scribd company logo
1 of 42
Download to read offline
ETL Confessions
PgConf US 2017
Corey Huinker
About Me
Database Programmer For 20 Years
1st Benchmark: TPC-C for AS/400
PostgreSQL, Oracle, DB2, SQLServer,
Sybase, Informix, SQLite, DB/400... and
mysql.
Dabbled in Improv and Roller Derby
About Moat
Advertising Analytics
Viewability, Reach, Fraud Detection
Tens of Billions of ad events per day
Summary data imports ~500M rows/day
Hiring for Offices NYC, London, SF, LA,
Singapore, Sydney, Austin, Cincinnati,
Miami
Database Workloads - OLTP
● On-Line Transaction Processing
● also known as Short Request
● Very High concurrency
● Most queries are very small in results fetched, data updated
● New data is essentially random
● Locking issues, latency paramount
Database Workloads - Data Warehouse
● Usually derivative data
○ Either from in-house OLTP databases
○ and other data sources
● Latency tolerance much higher
● Data imported in batches with something in common
○ data from specific time-frame
○ data from specific OLTP shards
○ externally sourced data
● Query throughput more important than any one query's individual
performance
● Data is stored in ways optimal for reading, not updating.
Extract, Transform, Load
● Extract
○ Get the data from the external sources
■ Other databases under our control
■ Data pulls from websites
■ Publicly available data
○ Maybe you control the format, maybe you don't
● Transform
○ Scan the data for correct formatting, valid values, etc
■ Discard? Fix?
○ Reformat the data to the shape of the table(s) you wish to load it.
● Load
○ Get it in the Data Warehouse as fast as possible
Toolsets - External Programs
● Homegrown data file reading, one INSERT at a time.
○ Read data from external data source
○ organize that data into rows
○ insert, one row at a time
○ Didn't I write this once in high school?
○ Client libraries offer conveniences like execute_many()
■ Client libraries lie
■ They lie, and they loop
Toolsets - Third Party Loader Programs
● Command line
○ pg_loader, pg_bulkload on PgSQL
○ SQL*Loader (Oracle)
○ bcp (Sybase/SQLServer)
● GUI tools
○ Kettle (Pentaho)
○ SSIS (Microsoft)
○ DataStage (IBM)
○ Informatica (Sauron)
Third Party Loader Programs - Pros
● Custom designed for many common solved problems
○ Never parse a CSV again
● Usually aware of bulk loading facilities in the target database
○ The fewer target databases, the more likely they're optimized (beware
vendor bias)
● Simple things are simple
Third Party Loader Programs - Cons
● If you want all your app logic logic in one place, you're probably out of luck
● Non-simple things can be impossible
○ Custom code often back to single row inserts
○ plugins usually involve writing in the language that the tool was written
in (Visual C++, Java) rather than your own core competencies
● Graphical interfaces conceal application logic
○ un grep-able
● "Code" often stored in XML/binary, resistant to source control
● Biz logic that requires existing database state requires local copy of that
state
○ Re-inventing the LEFT OUTER JOIN
○ Fun with race conditions
Extract, Load, Transform
● Extract - Same as ETL
● Load
○ Make local work tables shaped to suit the data as-is
● Transform
○ Filter - remove rows/columns you know you don't want
○ Validate - test values for correctness
○ Classify - data might be destined for multiple tables, have key
linkages, etc
○ Encode - Date strings become dates, etc.
○ Dereference - Enumerations and "dimensional" data become
associated with existing primary key values or new ones are inserted
○ Insert transformed data into user-facing tables
○ Isn't this just ETL with the 'T' inside the DB? Yes.
ELT - Pros
● Transformation logic is usually written in SQL statements, a skill you
already had internally.
● Easily worked into existing source control.
● Referencing existing data is trivial.
● Referencing existing data is transactional.
● You can reuse a lot of your existing data integrity logic.
ELT - Cons
● You're writing the data to the database twice.
○ Some of that data you don't even want
○ Additional disk space needed for transformation work...temporarily
○ Additional CPU burden during ETL
■ Not a big deal if you use read replicas and take the master out of
UI rotation.
● Some data validation may still require on external factors
● If your data is in very large data files, you may have to split them up
● If your data is in a large number of small files, you may have to use
multiple workers and manage the workload yourself.
○ That workload management is itself an OLTP-type task.
ELT Case 1: COPY a Big File
● The file:
○ Disk size: 40MB, compressed, 1.1GB uncompressed
○ Number of rows: 5160108, including header
○ Number of columns:
■ taxonomy columns (text): 8
■ metrics (bigint): ¯_(ツ)_/¯
● Format changes over time
● producers were not necessarily consistent
● updated data files would follow latest format, which might
conflict with original format
Case 1: Confession 1: Format Discovery
● Dates and taxonomy (text) column names are known and generally fixed,
but don't count on it.
● Any columns with unfamiliar names are assumed to be metrics (bigint)
● COPY just the header row into a one column table and split the row
afterward.
CREATE TEMP TABLE header(header_string text);
COPY header FROM PROGRAM 'zcat filename.csv.gz | head -n 1';
● regexp_split_to_table() on the one-row table
Case 1: Confession 2: Format Discovery
Same as before, but use plpythonu
CREATE OR REPLACE FUNCTION get_header(file_name text)
RETURNS text[]
LANGUAGE PLPYTHONU STRICT SET SEARCH_PATH FROM CURRENT
AS $PYTHON$
import gzip
with gzip.GzipFile(file_name,'rb') as g:
header = g.readline()
if not header:
return None
return header.rstrip().split(',')
$PYTHON$;
Case 1: Confession 2: Format Discovery
Another function to use that one and create a temp table
l_columns := get_csvgz_file_header(l_temp_file);
SELECT string_agg(CASE
WHEN colname = 'load_date' THEN 'load_date date'
WHEN colname in ('taxo1', ..., 'taxo8')
THEN format('%s text', colname)
ELSE format('%I bigint',colname)
END,
', ' ORDER BY col_order)
FROM unnest(l_columns) WITH ORDINALITY AS c(colname,col_order)
INTO l_col_defs;
EXECUTE format('create temporary table work_table (%s)', l_col_defs);
Now the temp table is created in a single transaction with a format to match its
own header and existing business rules.
ELT Case 2: COPY a Big File
● The test machine:
○ AWS EC2 m4.xlarge (4 cores, 16GB RAM)
○ PostgreSQL 10 pre-alpha!
● The production machine:
○ AWS EC2 i3.8xlarge (32 cores, 245GB RAM)
■ Approx 5000 jobs in an ETL cycle
■ Jobs need to play nice parallel-wise.
○ PostgreSQL 9.6
● AWS machines have a reputation for being I/O starved
Case 2: Load Times By Various Methods
● Simplify the test by knowing the format of the data file ahead of time.
○ Datafile: csv, 8 text columns, 123 bigints
○ 5.6M rows
Case 2: Load Method: pgloader
Create a control file to load the data into a regular table.
pgloader: an external program, controlled via command line switches or a
configuration file
load csv from ../d2.csv ( co1, col2, ..., colN )
into postgresql://test_user:test@localhost/pgloader_test?destination
with truncate, skip header = 1, fields optionally enclosed by '"';
Case 2: Load Method: pgloader
Timing
Loading uncompressed .csv:
real 3m33.102s
pigz -d --stdout compressed.csv.gz | pgloader test_stdin.load
real 4m37.150s
Case 2: Load Times By pg_bulkload
Direct load of uncompressed table
real 0m25.601s
pigz piped to pg_bulkload
real 0m36.924s
Case 2: COPY .CSV To Regular Table
COPY destination FROM '../d2.csv' WITH (FORMAT CSV, HEADER)
Time: 24124.291 ms (00:24.124)
pigz --stdout -d ../d2.csv.gz | time psql db_name -c "COPY destination FROM
STDIN WITH (FORMAT CSV, HEADER)"
Time: 24849.039 ms (00:24.849)
COPY destination FROM program 'pigz --stdout -d ../d2.csv.gz' WITH (FORMAT
CSV, HEADER)
Time: 25357.763 ms (00:25.358)
Case 2: COPY .CSV To Unlogged Table
COPY dest_ulog FROM '../d2.csv' WITH (FORMAT CSV, HEADER)
Time: 18714.313 ms (00:18.714)
pigz --stdout -d ../d2.csv.gz | time psql db_name -c "COPY dest_ulog FROM
STDIN WITH (FORMAT CSV, HEADER)"
Time: 19509.254 ms (00:19.509)
COPY dest_unlogged FROM program 'pigz --stdout -d ../d2.csv.gz' WITH
(FORMAT CSV, HEADER)
Time: 19264.937 ms (00:19.265)
Case 2: COPY .CSV To Unlogged Table
Case 2: Conclusion
● COPY FROM PROGRAM performs same as external unix pipes
● pigz only slightly better than gzip -d
● Decompressing .csv.gz is only slight overhead over uncompressed.
● pg_bulkload ineffective in situations where a naive COPY will do.
● pg_loader consumes all available CPU, still slower than all other
methods.
Case 3: Avoid the Bounce Table
CREATE TEMP TABLE bounce_table (...);
COPY bounce_table FROM '...';
ANALYZE bounce_table;
INSERT INTO table_that_actually_matters (...);
SELECT SUM(something), ... FROM bounce_table;
● Reading the file
● Write to temp table
● Read it right back out
● discard the temp table
● Oid churn with the temp table
● Nice if we could just read it as a table
Case 3: Temporary Foreign Table
CREATE SERVER filesystem FOREIGN DATA WRAPPER file_fdw;
CREATE FOREIGN TABLE pg_temp.straight_from_file (...)
SERVER filesystem;
INSERT INTO table_that_actually_matters(...);
SELECT SUM(something), ...
FROM pg_temp.straight_from_file;
● Should eliminate unnecessary writes
● Should halve the number of disk reads
● Allows for filtration using WHERE clause
● Still need to look out for invalid data.
● Still burns an oid
● No such thing as CREATE TEMPORARY FOREIGN TABLE, yet.
Case 3: Temporary Foreign Table
● SELECT SUM(c1), ..., SUM(c5) FROM pg_temp.straight_from_file;
Time: 9858.383 ms (00:09.858)
● SELECT SUM(c1), ..., SUM(c20) FROM pg_temp.straight_from_file;
Time: 11701.863 ms (00:11.702)
● COPY TO UNLOGGED TABLE
Time: 17477.338 ms (00:17.477)
● ANALYZE OF UNLOGGED TABLE
Time: 383.597 ms
● SUM 5 COLUMNS
Time: 1265.488 ms (00:01.265)
● SUM 20 COLUMNS
Time: 1992.116 ms (00:01.992)
Case 3: Temporary Foreign Table
Case 3: Conclusions
● file_fdw is fastest when most of the columns are ignored.
● timings closer to even as the number of columns referenced approaches all in the table
● explicit analyzing the foreign table has no discernable effect on SELECT performance
● row estimate done via file size.
● COPY + ANALYZE wins if you need to read from that table at least twice.
● reading only once you save 50% of the time, plus saved resources available for other programs.
● file_fdw can use PROGRAM option in v10
● be careful of bad row estimates in using PROGRAM
● COPY_SRF() (failed v10 patch)
○ COPY protocol in SRF form
○ does not burn an oid
○ materializes the entire file
○ Also subject to bad row estimates, but could encapsulate and alter the wrapper function
row estimate
○ Cannot copy from STDIN without a change to wireline protocol
Case 4: Rollup Tables
● One large table with very specific 8-column grain
● Most queries will aggregate to their own specific grain
● Pre-aggregate tables so that specific queries can choose smaller tables
that more directly suit their needs.
● Oracle does this transparently with materialized views.
The Rollups
(a,b,c,d,e,f,g,h) (a,b,c,d)
(a,b,c, e,f,g,h) (a,b,c )
(a,b, e,f,g,h) (a,b )
(a, e,f,g,h) (a )
( e,f,g,h) ( )
Case 4: Dumb Inserts
INSERT INTO d9 SELECT a,b,c,d,e,f,g,h, SUM(m1), ..., SUM(mN)
FROM copied_table GROUP BY a,b,c,d,e,f,g,h;
INSERT INTO d8 SELECT a,b,c, e,f,g,h, SUM(m1), ..., SUM(mN)
FROM copied_table GROUP BY a,b,c, e,f,g,h;
INSERT INTO d7 SELECT a,b, e,f,g,h, SUM(m1), ..., SUM(mN)
FROM copied_table GROUP BY a,b, e,f,g,h;
...
INSERT INTO d1 SELECT a SUM(m1), ..., SUM(mN)
FROM copied_table GROUP BY a;
INSERT INTO d0 SELECT SUM(m1), ..., SUM(mN)
FROM copied_table;
Case 4: Chained Inserts
INSERT INTO d9 SELECT a,b,c,d,e,f,g,h, SUM(m1), ..., SUM(mN)
FROM copied_table GROUP BY a,b,c,d,e,f,g,h;
INSERT INTO d8 SELECT a,b,c, e,f,g,h, SUM(m1), ..., SUM(mN)
FROM d9 GROUP BY a,b,c, e,f,g,h;
INSERT INTO d7 SELECT a,b, e,f,g,h, SUM(m1), ..., SUM(mN)
FROM d8 GROUP BY a,b, e,f,g,h;
...
INSERT INTO d1 SELECT a SUM(m1), ..., SUM(mN)
FROM d2 GROUP BY a;
INSERT INTO d0 SELECT SUM(m1), ..., SUM(mN)
FROM d1;
Case 4: Chained Inserts in a CTE
WITH temp_d9 as (
INSERT INTO d9 SELECT a,b,c,d,e,f,g,h, SUM(m1), ..., SUM(mN)
FROM copied_table GROUP BY a,b,c,d,e,f,g,h RETURNING *),
temp_d8 as (
INSERT INTO d8 SELECT a,b,c, e,f,g,h, SUM(m1), ..., SUM(mN)
FROM temp_d9 GROUP BY a,b,c, e,f,g,h RETURNING *),
temp_d7 as (
INSERT INTO d7 SELECT a,b, e,f,g,h, SUM(m1), ..., SUM(mN)
FROM temp_d8 GROUP BY a,b, e,f,g,h RETURNING *),
...
temp_d0 as (
INSERT INTO d1 SELECT a SUM(m1), ..., SUM(mN)
FROM temp_d1 GROUP BY a RETURNING *)
SELECT (SELECT COUNT(*) FROM temp_d9) as rows_d9,
(SELECT COUNT(*) FROM temp_d8) as rows_d8,
...
(SELECT COUNT(*) FROM temp_d0) as rows_d0;
Case 4: Dumb Rollup Table (1/2)
CREATE TEMP TABLE dumb_rollup (a,b,c,d,e,f,g,h,m1,...,mN,grouping_level)
SELECT a,b,c,d,e,f,g,h, SUM(m1), ..., SUM(mN),
CASE
WHEN grouping(d) = 0 and grouping(h) = 0 THEN 9
WHEN grouping(c) = 0 and grouping(h) = 0 THEN 8
...
WHEN grouping(c) = 0 THEN 3
WHEN grouping(b) = 0 THEN 2
WHEN grouping(a) = 0 THEN 1
ELSE 0
END
FROM copied_table
GROUP BY GROUPING SETS( (a, b, c, d, e, f, g, h),
(a, b, c, d, f, g, h),
...
(a, b),
(a),
() );
Case 4: Dumb Rollup Table (2/2)
INSERT INTO d9
SELECT a,b,c,d,e,f,g,h,m1,...,mN FROM dumb_rollup WHERE grouping_level = 9;
INSERT INTO d8
SELECT a,b,c, e,f,g,h,m1,...,mN FROM dumb_rollup WHERE grouping_level = 8;
...
INSERT INTO d1
SELECT a, m1,...,mN FROM dumb_rollup WHERE grouping_level = 1;
INSERT INTO d0
SELECT m1,...,mN FROM dumb_rollup WHERE grouping_level = 0;
Case 4: Chained CTE Rollup
WITH rollups (a,b,c,d,e,f,g,h,m1,...,mN,grouping_level)
SELECT a,b,c,d,e,f,g,h, SUM(m1), ..., SUM(mN),
CASE ... END
FROM copied_table
GROUP BY GROUPING SETS( ... ) ),
temp_d9 as (
INSERT INTO d9 SELECT ... FROM rollups WHERE grouping_level = 9 RETURNING NULL),
temp_d8 as (
INSERT INTO d8 SELECT ... FROM rollups WHERE grouping_level = 8 RETURNING NULL),
...
temp_d1 as (
INSERT INTO d9 SELECT ... FROM rollups WHERE grouping_level = 1 RETURNING NULL),
temp_d0 as (
INSERT INTO d8 SELECT ... FROM rollups WHERE grouping_level = 0 RETURNING NULL)
SELECT (SELECT COUNT(*) FROM temp_d9) as rows_d9,
(SELECT COUNT(*) FROM temp_d8) as rows_d8,
...
(SELECT COUNT(*) FROM temp_d0) as rows_d0;
Case 4: Rollup to v10 Native Partition ATR
Instead of progressively removing columns from tables, leave them all the same shape but enforce NULL constraints
CREATE TABLE dest_all ( a ..., m1 ..., grouping_level integer );
CREATE TABLE dest_9 PARTITION OF dest_all FOR VALUES IN (9);
CREATE TABLE dest_8 PARTITION OF dest_all FOR VALUES IN (8);
...
CREATE TABLE dest_0 PARTITION OF dest_all FOR VALUES IN (0);
INSERT INTO dest_all
SELECT a,b,c,d,e,f,g,h, SUM(m1), ..., SUM(mN),
CASE ... END AS grouping_level
FROM copied_table
GROUP BY GROUPING SETS( ... ) );
Case 4: What will be fastest?
Case 4: What will be fastest?
QUESTIONS

More Related Content

What's hot

Bruce Momjian - Inside PostgreSQL Shared Memory @ Postgres Open
Bruce Momjian - Inside PostgreSQL Shared Memory @ Postgres OpenBruce Momjian - Inside PostgreSQL Shared Memory @ Postgres Open
Bruce Momjian - Inside PostgreSQL Shared Memory @ Postgres OpenPostgresOpen
 
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
Using Apache Spark to Solve Sessionization Problem in Batch and StreamingUsing Apache Spark to Solve Sessionization Problem in Batch and Streaming
Using Apache Spark to Solve Sessionization Problem in Batch and StreamingDatabricks
 
PostgreSQL Procedural Languages: Tips, Tricks and Gotchas
PostgreSQL Procedural Languages: Tips, Tricks and GotchasPostgreSQL Procedural Languages: Tips, Tricks and Gotchas
PostgreSQL Procedural Languages: Tips, Tricks and GotchasJim Mlodgenski
 
Nov. 4, 2011 o reilly webcast-hbase- lars george
Nov. 4, 2011 o reilly webcast-hbase- lars georgeNov. 4, 2011 o reilly webcast-hbase- lars george
Nov. 4, 2011 o reilly webcast-hbase- lars georgeO'Reilly Media
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Alexey Lesovsky
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2PoguttuezhiniVP
 
Postgresql Database Administration- Day4
Postgresql Database Administration- Day4Postgresql Database Administration- Day4
Postgresql Database Administration- Day4PoguttuezhiniVP
 
Postgresql Database Administration- Day3
Postgresql Database Administration- Day3Postgresql Database Administration- Day3
Postgresql Database Administration- Day3PoguttuezhiniVP
 
OSMC 2014: Monitoring VoIP Systems | Sebastian Damm
OSMC 2014: Monitoring VoIP Systems | Sebastian DammOSMC 2014: Monitoring VoIP Systems | Sebastian Damm
OSMC 2014: Monitoring VoIP Systems | Sebastian DammNETWAYS
 
Query Optimizer in MariaDB 10.4
Query Optimizer in MariaDB 10.4Query Optimizer in MariaDB 10.4
Query Optimizer in MariaDB 10.4Sergey Petrunya
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL AdministrationEDB
 
Graphing Nagios services with pnp4nagios
Graphing Nagios services with pnp4nagiosGraphing Nagios services with pnp4nagios
Graphing Nagios services with pnp4nagiosjasonholtzapple
 

What's hot (19)

Strategic autovacuum
Strategic autovacuumStrategic autovacuum
Strategic autovacuum
 
Bruce Momjian - Inside PostgreSQL Shared Memory @ Postgres Open
Bruce Momjian - Inside PostgreSQL Shared Memory @ Postgres OpenBruce Momjian - Inside PostgreSQL Shared Memory @ Postgres Open
Bruce Momjian - Inside PostgreSQL Shared Memory @ Postgres Open
 
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
Using Apache Spark to Solve Sessionization Problem in Batch and StreamingUsing Apache Spark to Solve Sessionization Problem in Batch and Streaming
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
 
PostgreSQL Procedural Languages: Tips, Tricks and Gotchas
PostgreSQL Procedural Languages: Tips, Tricks and GotchasPostgreSQL Procedural Languages: Tips, Tricks and Gotchas
PostgreSQL Procedural Languages: Tips, Tricks and Gotchas
 
Postgre sql unleashed
Postgre sql unleashedPostgre sql unleashed
Postgre sql unleashed
 
PostgreSQL and PL/Java
PostgreSQL and PL/JavaPostgreSQL and PL/Java
PostgreSQL and PL/Java
 
Gur1009
Gur1009Gur1009
Gur1009
 
Nov. 4, 2011 o reilly webcast-hbase- lars george
Nov. 4, 2011 o reilly webcast-hbase- lars georgeNov. 4, 2011 o reilly webcast-hbase- lars george
Nov. 4, 2011 o reilly webcast-hbase- lars george
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2
 
Postgresql Database Administration- Day4
Postgresql Database Administration- Day4Postgresql Database Administration- Day4
Postgresql Database Administration- Day4
 
Postgresql Database Administration- Day3
Postgresql Database Administration- Day3Postgresql Database Administration- Day3
Postgresql Database Administration- Day3
 
Full Text Search in PostgreSQL
Full Text Search in PostgreSQLFull Text Search in PostgreSQL
Full Text Search in PostgreSQL
 
Stored procedure
Stored procedureStored procedure
Stored procedure
 
8.4 Upcoming Features
8.4 Upcoming Features 8.4 Upcoming Features
8.4 Upcoming Features
 
OSMC 2014: Monitoring VoIP Systems | Sebastian Damm
OSMC 2014: Monitoring VoIP Systems | Sebastian DammOSMC 2014: Monitoring VoIP Systems | Sebastian Damm
OSMC 2014: Monitoring VoIP Systems | Sebastian Damm
 
Query Optimizer in MariaDB 10.4
Query Optimizer in MariaDB 10.4Query Optimizer in MariaDB 10.4
Query Optimizer in MariaDB 10.4
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
 
Graphing Nagios services with pnp4nagios
Graphing Nagios services with pnp4nagiosGraphing Nagios services with pnp4nagios
Graphing Nagios services with pnp4nagios
 

Similar to Etl confessions pg conf us 2017

Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
Shared Database Concurrency
Shared Database ConcurrencyShared Database Concurrency
Shared Database ConcurrencyAivars Kalvans
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxXinliShang1
 
How to make data available for analytics ASAP
How to make data available for analytics ASAPHow to make data available for analytics ASAP
How to make data available for analytics ASAPMariaDB plc
 
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...confluent
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
 
In-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several timesIn-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several timesAleksander Alekseev
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django applicationbangaloredjangousergroup
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodbPGConf APAC
 
PostgreSQL and Redis - talk at pgcon 2013
PostgreSQL and Redis - talk at pgcon 2013PostgreSQL and Redis - talk at pgcon 2013
PostgreSQL and Redis - talk at pgcon 2013Andrew Dunstan
 
The immutable database datomic
The immutable database   datomicThe immutable database   datomic
The immutable database datomicLaurence Chen
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...Amazon Web Services
 
Sql server performance tuning
Sql server performance tuningSql server performance tuning
Sql server performance tuningngupt28
 
Tales from the Field
Tales from the FieldTales from the Field
Tales from the FieldMongoDB
 
Accumulo Summit Keynote 2018
Accumulo Summit Keynote 2018Accumulo Summit Keynote 2018
Accumulo Summit Keynote 2018Accumulo Summit
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 

Similar to Etl confessions pg conf us 2017 (20)

Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Shared Database Concurrency
Shared Database ConcurrencyShared Database Concurrency
Shared Database Concurrency
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
 
How to make data available for analytics ASAP
How to make data available for analytics ASAPHow to make data available for analytics ASAP
How to make data available for analytics ASAP
 
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
In-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several timesIn-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several times
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django application
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
The Accidental DBA
The Accidental DBAThe Accidental DBA
The Accidental DBA
 
PostgreSQL and Redis - talk at pgcon 2013
PostgreSQL and Redis - talk at pgcon 2013PostgreSQL and Redis - talk at pgcon 2013
PostgreSQL and Redis - talk at pgcon 2013
 
The immutable database datomic
The immutable database   datomicThe immutable database   datomic
The immutable database datomic
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
 
Caching in
Caching inCaching in
Caching in
 
Sql server performance tuning
Sql server performance tuningSql server performance tuning
Sql server performance tuning
 
Tales from the Field
Tales from the FieldTales from the Field
Tales from the Field
 
Accumulo Summit Keynote 2018
Accumulo Summit Keynote 2018Accumulo Summit Keynote 2018
Accumulo Summit Keynote 2018
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 

Recently uploaded

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Etl confessions pg conf us 2017

  • 1. ETL Confessions PgConf US 2017 Corey Huinker
  • 2. About Me Database Programmer For 20 Years 1st Benchmark: TPC-C for AS/400 PostgreSQL, Oracle, DB2, SQLServer, Sybase, Informix, SQLite, DB/400... and mysql. Dabbled in Improv and Roller Derby
  • 3. About Moat Advertising Analytics Viewability, Reach, Fraud Detection Tens of Billions of ad events per day Summary data imports ~500M rows/day Hiring for Offices NYC, London, SF, LA, Singapore, Sydney, Austin, Cincinnati, Miami
  • 4. Database Workloads - OLTP ● On-Line Transaction Processing ● also known as Short Request ● Very High concurrency ● Most queries are very small in results fetched, data updated ● New data is essentially random ● Locking issues, latency paramount
  • 5. Database Workloads - Data Warehouse ● Usually derivative data ○ Either from in-house OLTP databases ○ and other data sources ● Latency tolerance much higher ● Data imported in batches with something in common ○ data from specific time-frame ○ data from specific OLTP shards ○ externally sourced data ● Query throughput more important than any one query's individual performance ● Data is stored in ways optimal for reading, not updating.
  • 6. Extract, Transform, Load ● Extract ○ Get the data from the external sources ■ Other databases under our control ■ Data pulls from websites ■ Publicly available data ○ Maybe you control the format, maybe you don't ● Transform ○ Scan the data for correct formatting, valid values, etc ■ Discard? Fix? ○ Reformat the data to the shape of the table(s) you wish to load it. ● Load ○ Get it in the Data Warehouse as fast as possible
  • 7. Toolsets - External Programs ● Homegrown data file reading, one INSERT at a time. ○ Read data from external data source ○ organize that data into rows ○ insert, one row at a time ○ Didn't I write this once in high school? ○ Client libraries offer conveniences like execute_many() ■ Client libraries lie ■ They lie, and they loop
  • 8. Toolsets - Third Party Loader Programs ● Command line ○ pg_loader, pg_bulkload on PgSQL ○ SQL*Loader (Oracle) ○ bcp (Sybase/SQLServer) ● GUI tools ○ Kettle (Pentaho) ○ SSIS (Microsoft) ○ DataStage (IBM) ○ Informatica (Sauron)
  • 9. Third Party Loader Programs - Pros ● Custom designed for many common solved problems ○ Never parse a CSV again ● Usually aware of bulk loading facilities in the target database ○ The fewer target databases, the more likely they're optimized (beware vendor bias) ● Simple things are simple
  • 10. Third Party Loader Programs - Cons ● If you want all your app logic logic in one place, you're probably out of luck ● Non-simple things can be impossible ○ Custom code often back to single row inserts ○ plugins usually involve writing in the language that the tool was written in (Visual C++, Java) rather than your own core competencies ● Graphical interfaces conceal application logic ○ un grep-able ● "Code" often stored in XML/binary, resistant to source control ● Biz logic that requires existing database state requires local copy of that state ○ Re-inventing the LEFT OUTER JOIN ○ Fun with race conditions
  • 11. Extract, Load, Transform ● Extract - Same as ETL ● Load ○ Make local work tables shaped to suit the data as-is ● Transform ○ Filter - remove rows/columns you know you don't want ○ Validate - test values for correctness ○ Classify - data might be destined for multiple tables, have key linkages, etc ○ Encode - Date strings become dates, etc. ○ Dereference - Enumerations and "dimensional" data become associated with existing primary key values or new ones are inserted ○ Insert transformed data into user-facing tables ○ Isn't this just ETL with the 'T' inside the DB? Yes.
  • 12. ELT - Pros ● Transformation logic is usually written in SQL statements, a skill you already had internally. ● Easily worked into existing source control. ● Referencing existing data is trivial. ● Referencing existing data is transactional. ● You can reuse a lot of your existing data integrity logic.
  • 13. ELT - Cons ● You're writing the data to the database twice. ○ Some of that data you don't even want ○ Additional disk space needed for transformation work...temporarily ○ Additional CPU burden during ETL ■ Not a big deal if you use read replicas and take the master out of UI rotation. ● Some data validation may still require on external factors ● If your data is in very large data files, you may have to split them up ● If your data is in a large number of small files, you may have to use multiple workers and manage the workload yourself. ○ That workload management is itself an OLTP-type task.
  • 14. ELT Case 1: COPY a Big File ● The file: ○ Disk size: 40MB, compressed, 1.1GB uncompressed ○ Number of rows: 5160108, including header ○ Number of columns: ■ taxonomy columns (text): 8 ■ metrics (bigint): ¯_(ツ)_/¯ ● Format changes over time ● producers were not necessarily consistent ● updated data files would follow latest format, which might conflict with original format
  • 15. Case 1: Confession 1: Format Discovery ● Dates and taxonomy (text) column names are known and generally fixed, but don't count on it. ● Any columns with unfamiliar names are assumed to be metrics (bigint) ● COPY just the header row into a one column table and split the row afterward. CREATE TEMP TABLE header(header_string text); COPY header FROM PROGRAM 'zcat filename.csv.gz | head -n 1'; ● regexp_split_to_table() on the one-row table
  • 16. Case 1: Confession 2: Format Discovery Same as before, but use plpythonu CREATE OR REPLACE FUNCTION get_header(file_name text) RETURNS text[] LANGUAGE PLPYTHONU STRICT SET SEARCH_PATH FROM CURRENT AS $PYTHON$ import gzip with gzip.GzipFile(file_name,'rb') as g: header = g.readline() if not header: return None return header.rstrip().split(',') $PYTHON$;
  • 17. Case 1: Confession 2: Format Discovery Another function to use that one and create a temp table l_columns := get_csvgz_file_header(l_temp_file); SELECT string_agg(CASE WHEN colname = 'load_date' THEN 'load_date date' WHEN colname in ('taxo1', ..., 'taxo8') THEN format('%s text', colname) ELSE format('%I bigint',colname) END, ', ' ORDER BY col_order) FROM unnest(l_columns) WITH ORDINALITY AS c(colname,col_order) INTO l_col_defs; EXECUTE format('create temporary table work_table (%s)', l_col_defs); Now the temp table is created in a single transaction with a format to match its own header and existing business rules.
  • 18. ELT Case 2: COPY a Big File ● The test machine: ○ AWS EC2 m4.xlarge (4 cores, 16GB RAM) ○ PostgreSQL 10 pre-alpha! ● The production machine: ○ AWS EC2 i3.8xlarge (32 cores, 245GB RAM) ■ Approx 5000 jobs in an ETL cycle ■ Jobs need to play nice parallel-wise. ○ PostgreSQL 9.6 ● AWS machines have a reputation for being I/O starved
  • 19. Case 2: Load Times By Various Methods ● Simplify the test by knowing the format of the data file ahead of time. ○ Datafile: csv, 8 text columns, 123 bigints ○ 5.6M rows
  • 20. Case 2: Load Method: pgloader Create a control file to load the data into a regular table. pgloader: an external program, controlled via command line switches or a configuration file load csv from ../d2.csv ( co1, col2, ..., colN ) into postgresql://test_user:test@localhost/pgloader_test?destination with truncate, skip header = 1, fields optionally enclosed by '"';
  • 21. Case 2: Load Method: pgloader Timing Loading uncompressed .csv: real 3m33.102s pigz -d --stdout compressed.csv.gz | pgloader test_stdin.load real 4m37.150s
  • 22. Case 2: Load Times By pg_bulkload Direct load of uncompressed table real 0m25.601s pigz piped to pg_bulkload real 0m36.924s
  • 23. Case 2: COPY .CSV To Regular Table COPY destination FROM '../d2.csv' WITH (FORMAT CSV, HEADER) Time: 24124.291 ms (00:24.124) pigz --stdout -d ../d2.csv.gz | time psql db_name -c "COPY destination FROM STDIN WITH (FORMAT CSV, HEADER)" Time: 24849.039 ms (00:24.849) COPY destination FROM program 'pigz --stdout -d ../d2.csv.gz' WITH (FORMAT CSV, HEADER) Time: 25357.763 ms (00:25.358)
  • 24. Case 2: COPY .CSV To Unlogged Table COPY dest_ulog FROM '../d2.csv' WITH (FORMAT CSV, HEADER) Time: 18714.313 ms (00:18.714) pigz --stdout -d ../d2.csv.gz | time psql db_name -c "COPY dest_ulog FROM STDIN WITH (FORMAT CSV, HEADER)" Time: 19509.254 ms (00:19.509) COPY dest_unlogged FROM program 'pigz --stdout -d ../d2.csv.gz' WITH (FORMAT CSV, HEADER) Time: 19264.937 ms (00:19.265)
  • 25. Case 2: COPY .CSV To Unlogged Table
  • 26. Case 2: Conclusion ● COPY FROM PROGRAM performs same as external unix pipes ● pigz only slightly better than gzip -d ● Decompressing .csv.gz is only slight overhead over uncompressed. ● pg_bulkload ineffective in situations where a naive COPY will do. ● pg_loader consumes all available CPU, still slower than all other methods.
  • 27. Case 3: Avoid the Bounce Table CREATE TEMP TABLE bounce_table (...); COPY bounce_table FROM '...'; ANALYZE bounce_table; INSERT INTO table_that_actually_matters (...); SELECT SUM(something), ... FROM bounce_table; ● Reading the file ● Write to temp table ● Read it right back out ● discard the temp table ● Oid churn with the temp table ● Nice if we could just read it as a table
  • 28. Case 3: Temporary Foreign Table CREATE SERVER filesystem FOREIGN DATA WRAPPER file_fdw; CREATE FOREIGN TABLE pg_temp.straight_from_file (...) SERVER filesystem; INSERT INTO table_that_actually_matters(...); SELECT SUM(something), ... FROM pg_temp.straight_from_file; ● Should eliminate unnecessary writes ● Should halve the number of disk reads ● Allows for filtration using WHERE clause ● Still need to look out for invalid data. ● Still burns an oid ● No such thing as CREATE TEMPORARY FOREIGN TABLE, yet.
  • 29. Case 3: Temporary Foreign Table ● SELECT SUM(c1), ..., SUM(c5) FROM pg_temp.straight_from_file; Time: 9858.383 ms (00:09.858) ● SELECT SUM(c1), ..., SUM(c20) FROM pg_temp.straight_from_file; Time: 11701.863 ms (00:11.702) ● COPY TO UNLOGGED TABLE Time: 17477.338 ms (00:17.477) ● ANALYZE OF UNLOGGED TABLE Time: 383.597 ms ● SUM 5 COLUMNS Time: 1265.488 ms (00:01.265) ● SUM 20 COLUMNS Time: 1992.116 ms (00:01.992)
  • 30. Case 3: Temporary Foreign Table
  • 31. Case 3: Conclusions ● file_fdw is fastest when most of the columns are ignored. ● timings closer to even as the number of columns referenced approaches all in the table ● explicit analyzing the foreign table has no discernable effect on SELECT performance ● row estimate done via file size. ● COPY + ANALYZE wins if you need to read from that table at least twice. ● reading only once you save 50% of the time, plus saved resources available for other programs. ● file_fdw can use PROGRAM option in v10 ● be careful of bad row estimates in using PROGRAM ● COPY_SRF() (failed v10 patch) ○ COPY protocol in SRF form ○ does not burn an oid ○ materializes the entire file ○ Also subject to bad row estimates, but could encapsulate and alter the wrapper function row estimate ○ Cannot copy from STDIN without a change to wireline protocol
  • 32. Case 4: Rollup Tables ● One large table with very specific 8-column grain ● Most queries will aggregate to their own specific grain ● Pre-aggregate tables so that specific queries can choose smaller tables that more directly suit their needs. ● Oracle does this transparently with materialized views. The Rollups (a,b,c,d,e,f,g,h) (a,b,c,d) (a,b,c, e,f,g,h) (a,b,c ) (a,b, e,f,g,h) (a,b ) (a, e,f,g,h) (a ) ( e,f,g,h) ( )
  • 33. Case 4: Dumb Inserts INSERT INTO d9 SELECT a,b,c,d,e,f,g,h, SUM(m1), ..., SUM(mN) FROM copied_table GROUP BY a,b,c,d,e,f,g,h; INSERT INTO d8 SELECT a,b,c, e,f,g,h, SUM(m1), ..., SUM(mN) FROM copied_table GROUP BY a,b,c, e,f,g,h; INSERT INTO d7 SELECT a,b, e,f,g,h, SUM(m1), ..., SUM(mN) FROM copied_table GROUP BY a,b, e,f,g,h; ... INSERT INTO d1 SELECT a SUM(m1), ..., SUM(mN) FROM copied_table GROUP BY a; INSERT INTO d0 SELECT SUM(m1), ..., SUM(mN) FROM copied_table;
  • 34. Case 4: Chained Inserts INSERT INTO d9 SELECT a,b,c,d,e,f,g,h, SUM(m1), ..., SUM(mN) FROM copied_table GROUP BY a,b,c,d,e,f,g,h; INSERT INTO d8 SELECT a,b,c, e,f,g,h, SUM(m1), ..., SUM(mN) FROM d9 GROUP BY a,b,c, e,f,g,h; INSERT INTO d7 SELECT a,b, e,f,g,h, SUM(m1), ..., SUM(mN) FROM d8 GROUP BY a,b, e,f,g,h; ... INSERT INTO d1 SELECT a SUM(m1), ..., SUM(mN) FROM d2 GROUP BY a; INSERT INTO d0 SELECT SUM(m1), ..., SUM(mN) FROM d1;
  • 35. Case 4: Chained Inserts in a CTE WITH temp_d9 as ( INSERT INTO d9 SELECT a,b,c,d,e,f,g,h, SUM(m1), ..., SUM(mN) FROM copied_table GROUP BY a,b,c,d,e,f,g,h RETURNING *), temp_d8 as ( INSERT INTO d8 SELECT a,b,c, e,f,g,h, SUM(m1), ..., SUM(mN) FROM temp_d9 GROUP BY a,b,c, e,f,g,h RETURNING *), temp_d7 as ( INSERT INTO d7 SELECT a,b, e,f,g,h, SUM(m1), ..., SUM(mN) FROM temp_d8 GROUP BY a,b, e,f,g,h RETURNING *), ... temp_d0 as ( INSERT INTO d1 SELECT a SUM(m1), ..., SUM(mN) FROM temp_d1 GROUP BY a RETURNING *) SELECT (SELECT COUNT(*) FROM temp_d9) as rows_d9, (SELECT COUNT(*) FROM temp_d8) as rows_d8, ... (SELECT COUNT(*) FROM temp_d0) as rows_d0;
  • 36. Case 4: Dumb Rollup Table (1/2) CREATE TEMP TABLE dumb_rollup (a,b,c,d,e,f,g,h,m1,...,mN,grouping_level) SELECT a,b,c,d,e,f,g,h, SUM(m1), ..., SUM(mN), CASE WHEN grouping(d) = 0 and grouping(h) = 0 THEN 9 WHEN grouping(c) = 0 and grouping(h) = 0 THEN 8 ... WHEN grouping(c) = 0 THEN 3 WHEN grouping(b) = 0 THEN 2 WHEN grouping(a) = 0 THEN 1 ELSE 0 END FROM copied_table GROUP BY GROUPING SETS( (a, b, c, d, e, f, g, h), (a, b, c, d, f, g, h), ... (a, b), (a), () );
  • 37. Case 4: Dumb Rollup Table (2/2) INSERT INTO d9 SELECT a,b,c,d,e,f,g,h,m1,...,mN FROM dumb_rollup WHERE grouping_level = 9; INSERT INTO d8 SELECT a,b,c, e,f,g,h,m1,...,mN FROM dumb_rollup WHERE grouping_level = 8; ... INSERT INTO d1 SELECT a, m1,...,mN FROM dumb_rollup WHERE grouping_level = 1; INSERT INTO d0 SELECT m1,...,mN FROM dumb_rollup WHERE grouping_level = 0;
  • 38. Case 4: Chained CTE Rollup WITH rollups (a,b,c,d,e,f,g,h,m1,...,mN,grouping_level) SELECT a,b,c,d,e,f,g,h, SUM(m1), ..., SUM(mN), CASE ... END FROM copied_table GROUP BY GROUPING SETS( ... ) ), temp_d9 as ( INSERT INTO d9 SELECT ... FROM rollups WHERE grouping_level = 9 RETURNING NULL), temp_d8 as ( INSERT INTO d8 SELECT ... FROM rollups WHERE grouping_level = 8 RETURNING NULL), ... temp_d1 as ( INSERT INTO d9 SELECT ... FROM rollups WHERE grouping_level = 1 RETURNING NULL), temp_d0 as ( INSERT INTO d8 SELECT ... FROM rollups WHERE grouping_level = 0 RETURNING NULL) SELECT (SELECT COUNT(*) FROM temp_d9) as rows_d9, (SELECT COUNT(*) FROM temp_d8) as rows_d8, ... (SELECT COUNT(*) FROM temp_d0) as rows_d0;
  • 39. Case 4: Rollup to v10 Native Partition ATR Instead of progressively removing columns from tables, leave them all the same shape but enforce NULL constraints CREATE TABLE dest_all ( a ..., m1 ..., grouping_level integer ); CREATE TABLE dest_9 PARTITION OF dest_all FOR VALUES IN (9); CREATE TABLE dest_8 PARTITION OF dest_all FOR VALUES IN (8); ... CREATE TABLE dest_0 PARTITION OF dest_all FOR VALUES IN (0); INSERT INTO dest_all SELECT a,b,c,d,e,f,g,h, SUM(m1), ..., SUM(mN), CASE ... END AS grouping_level FROM copied_table GROUP BY GROUPING SETS( ... ) );
  • 40. Case 4: What will be fastest?
  • 41. Case 4: What will be fastest?