Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Why PostgreSQL for Analytics Infrastructure (DW)?

2,188 views

Published on

Talk given at Grokking TechTalk 14 - Database Systems

Why starting out with Postgres for reporting/analytics database.

Huy Nguyen
CTO of Holistics.io - BI & Infrastructure SaaS.

Published in: Data & Analytics

Why PostgreSQL for Analytics Infrastructure (DW)?

  1. 1. Huy Nguyen CTO, Cofounder - Holistics.io Why PostgreSQL for Analytics Infrastructure (DW)? Grokking TechTalk - Database Systems Ho Chi Minh City - Aug 2016
  2. 2. ● Cofounder ○ Data Reporting (BI) and Infrastructure SaaS ● Cofounder of Grokking Vietnam ○ Building community of world-class engineers in Vietnam ● Previous ○ Growth Team at Facebook (US) ○ Built Data Pipeline at Viki (Singapore) About Me
  3. 3. Background: What is Analytics/DW?
  4. 4. - A Typical Web Application Data-related Business Problems: • Daily/weekly registered users by different platforms, countries? • How many video uploads do we have everyday?
  5. 5. - A Typical Web Application • Daily/weekly registered users by different platforms, countries? • How many video uploads do we have everyday?
  6. 6. A Typical Data Pipeline
  7. 7. Analytics Database CSVs / Excels / Google Sheets Operational Data Data Warehouse Reporting / Analysis Data Science / ML Reporting / BI Event Logs (behavioural data) Live Databases Live Databases Production DBs Daily Snapshot Import Pre-aggregate Modify / Transform
  8. 8. Analytics Database CSVs / Excels / Google Sheets Operational Data Data Warehouse Reporting / Analysis Data Science / ML Reporting / BI Event Logs (behavioural data) Live Databases Live Databases Production DBs Daily Snapshot Import Pre-aggregate Modify / Transform What database should we pick?
  9. 9. Transactional Applications vs Analytics Applications Ref: http://www.slideshare.net/PGExperts/really-big-elephants-postgresql-dw-15833438 (slide 5) Data: ● Many single-row writes ● Current, single data Queries: ● Generated by user activities; 10 to 1000 users ● < 1s response time ● Short queries Data: ● Few large batch imports ● Years of data, many sources Queries: ● Generated by large reports; 1 to 10 users ● Queries run for hours ● Long queries
  10. 10. Ref: http://www.slideshare.net/PGExperts/really-big-elephants-postgresql-dw-15833438 (slide 8) Complex Query...
  11. 11. Why start with Postgres? 1. Simple to Get Started 2. Rich Features for Analytics – Data Pipeline (ETL) – Data Analysis 3. Scale Up (3) Scale(1) Start (2) Grow Data Growth
  12. 12. 1. Simple to Get Started 2. Rich Features for Analytics – Data Pipeline (ETL) – Data Analysis 3. Scale Up Why start with Postgres? (3) Scale(1) Start (2) Grow Data Growth
  13. 13. 1 Simple to Get Started ● Data requests grow gradually as your company grows ● Business users care about results (not backend) Postgres: ● Free (open-source) ● Easy to setup → Need something quick to start, easy to fine-tune along the way 1. Simple start 2. Rich features 3. Scale up
  14. 14. 1. Simple to Get Started 2. Rich Features for Analytics – Data Pipeline (ETL) – Data Analysis 3. Scale Up Why start with Postgres? (3) Scale(1) Start (2) Grow Data Growth
  15. 15. Analytics Database CSVs / Excels / Google Sheets Operational Data Data Warehouse Reporting / Analysis Data Science / ML Reporting / BI Event Logs (behavioural data) Live Databases Live Databases Production DBs Daily Snapshot Import Pre-aggregate Modify / Transform Data Pipeline (ETL) Data Analysis 1. Simple start 2. Rich features 3. Scale up
  16. 16. Analytics Database CSVs / Excels / Google Sheets Data Warehouse Event Logs (behavioural data) Live Databases Live Databases Production DBs Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table 1. Simple start 2. Rich features 3. Scale up
  17. 17. ● Managing Table Data: table partitioning ● Managing Disk Space: tablespace ● Write Performance: unlogged table ● Others: foreign data wrapper, point-in-time recovery 2 a- Data Pipeline (ETL) & Performance 1. Simple start 2. Rich features 3. Scale up
  18. 18. ● Managing Table Data: table partitioning ● Managing Disk Space: tablespace ● Write Performance: unlogged table ● Others: foreign data wrapper, point-in-time recovery 2 a- Data Pipeline (ETL) & Performance 1. Simple start 2. Rich features 3. Scale up
  19. 19. Analytics tables hold lots of data Managing Data Tables pageviews_2015_06 pageviews_2015_07 pageviews_2015_08 pageviews_2015_09 Solution: Split (partition) to multiple tables Problem: Difficult to query data across multiple months ⇒ Table grows big quickly, difficult to manage ! pageviews (+ 100k records a day) date_d | country | user_id | browser | page_name | views 1. Simple start 2. Rich features 3. Scale up
  20. 20. Managing Data Tables: parent table pageviews_2015_06 pageviews_2015_07 pageviews_2015_08 pageviews_2015_09 … ALTER TABLE pageviews_2015_09 INHERIT video_plays; ALTER TABLE pageviews_2015_09 ADD CONSTRAINT CHECK date_d >= '2015-09-01' AND date_d < '2015-10-01'; pageviews_parent (parent table) 1. Simple start 2. Rich features 3. Scale up
  21. 21. ● Managing Table Data: table partitioning ● Managing Disk Space: tablespace ● Write Performance: unlogged table ● Others: foreign data wrapper, point-in-time recovery 2 a- Data Pipeline (ETL) & Performance 1. Simple start 2. Rich features 3. Scale up
  22. 22. Analytics DB holds lots of data; hardware spaces are limited ● SSD: fast, expensive ● SATA: cheap, slow Data have different access frequency ● Hot Data ● Warm Data ● Cold Data Managing Disk-spaces 1. Simple start 2. Rich features 3. Scale up
  23. 23. Tablespace: Define where your tables are stored on disks Managing Disk-spaces: tablespace CREATE TABLESPACE hot_data LOCATION /disk0/ssd/ CREATE TABLESPACE warm_data LOCATION /disk1/sata2/ # beginning of the month CREATE TABLE pageviews_2016_08 TABLESPACE hot_data; ALTER TABLE pageviews_2016_07 TABLESPACE warm_data; 1. Simple start 2. Rich features 3. Scale up
  24. 24. Combining TABLESPACE and PARENT TABLE pageviews_2015_06 pageviews_2015_07 pageviews_2015_08 pageviews_2015_09 … pageviews_parent (parent table) 1. Simple start 2. Rich features 3. Scale up
  25. 25. ● Managing Table Data: table partitioning ● Managing Disk Space: tablespace ● Write Performance: unlogged table ● Others: foreign data wrapper, point-in-time recovery 2 a- Data Pipeline (ETL) & Performance 1. Simple start 2. Rich features 3. Scale up
  26. 26. Analytics Database CSVs / Excels / Google Sheets Data Warehouse Event Logs (behavioural data) Live Databases Live Databases Production DBs Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Analytics tables can be rebuilt from source 1. Simple start 2. Rich features 3. Scale up
  27. 27. CREATE TABLE daily_summary(...) UNLOGGED; INSERT INTO daily_summary …; Write Performance: unlogged table ● Transactional Safety: Every update is 2 writes: ○ Update data inside table ○ Write WAL (Write Ahead Log) ● UNLOGGED TABLE ○ Skip WAL log ○ Improved Write Performance http://pgsnaga.blogspot.com/2011/10/data-loading-into-unlogged-tables-and.html 1. Simple start 2. Rich features 3. Scale up
  28. 28. ● Managing Table Data: table partitioning ● Managing Disk Space: tablespace ● Write Performance: unlogged table ● Others: foreign data wrapper, point-in-time recovery 2 a- Data Pipeline (ETL) & Performance 1. Simple start 2. Rich features 3. Scale up
  29. 29. ● Extract / transform ● Aggregate / summarize ● Statistical analysis 2- b- Data Analysis (writing SQLs) Analytics Database Data Warehouse Reporting / Analysis Data Science / ML Reporting / BI 1. Simple start 2. Rich features 3. Scale up
  30. 30. ● SQL features ○ WITH clause ○ Window functions ○ Aggregation functions ○ Statistical functions ● Data structures ○ JSON / JSONB ○ Arrays ○ PostGIS (geo data) ○ Geometry (point, line, etc) ○ HyperLogLog (extension) 2- b - Data Analysis with Postgres ● PL/SQL ● Full-text search (n-gram) ● Performance: ○ Parallel queries (pg9.6) ○ Materialized views ○ BRIN index ● Others: ○ DISTINCT ON ○ VALUES ○ generate_series() ○ Support FULL OUTER JOIN ○ Better EXPLAIN 1. Simple start 2. Rich features 3. Scale up
  31. 31. SELECT ... FROM (SELECT ... FROM t1 JOIN (SELECT ... FROM ...) a ON (...) ) b JOIN (SELECT ... FROM ...) c ON (...) CTE - Problem with Nested Queries Nested queries are a) hard to read b) cannot be reused 1. Simple start 2. Rich features 3. Scale up
  32. 32. CTE - Common Table Expressions (WITH clause) WITH a AS ( SELECT ... FROM ... ), b AS ( SELECT ... FROM t1 JOIN a ON (...) ), c AS ( SELECT ... FROM ... ) SELECT ... FROM b JOIN c ON ... ● SQL’s “private methods” ● WITH view can be referred multiple times ● Allows chaining instead of nesting 1. Simple start 2. Rich features 3. Scale up
  33. 33. CTE (cont.) ● Recursive CTE ● Writeable CTE 1. Simple start 2. Rich features 3. Scale up # move data from A to B WITH deleted_rows AS ( DELETE FROM a WHERE ... RETURNING * ) INSERT INTO b SELECT * FROM deleted_rows;
  34. 34. SELECT gender, COUNT(1) AS signups FROM users GROUP BY 1 ● GROUP BY aggregate: reduce a partition of data into 1 value Limitation of GROUP BY aggregate What if we want to work through each row of each partition? 1. Simple start 2. Rich features 3. Scale up
  35. 35. ● Window functions: moving frame of 1 partition data ● Examples: ○ Calculate moving average ○ Cumulative sum ○ Ranking by partition ○ … Window functions 1. Simple start 2. Rich features 3. Scale up
  36. 36. SELECT created_at::date AS date_d, COUNT(1) AS daily_signups, SUM(COUNT(1)) OVER (ORDER BY dated_d) AS cumulative_signups FROM users U GROUP BY 1 ORDER BY 1 | date_d | daily_signups | cumulative_signups | | 2016-08-01 | 100 | 100 | | 2016-08-02 | 50 | 150 | | 2016-08-03 | 80 | 230 | Example: Cumulative Sum CREATE TABLE users ( id INT, gender VARCHAR(10), created_at TIMESTAMP ); 1. Simple start 2. Rich features 3. Scale up
  37. 37. SELECT gender, name, RANK() OVER (PARTITION BY gender ORDER BY created_at) AS signup_rnk FROM users U ORDER BY 1, 3; | gender | name | signup_rnk | | male | Hung | 1 | | male | Son | 2 | | ... | | female | Lan | 1 | | female | Tuyet | 2 | Example: Group by Gender and rank by signup time CREATE TABLE users ( id INT, name VARCHAR, gender VARCHAR(10), created_at TIMESTAMP ); 1. Simple start 2. Rich features 3. Scale up
  38. 38. ● SQL features ○ WITH clause ○ Window functions ○ Aggregation functions ○ Statistical functions ● Data structures ○ JSON / JSONB ○ Arrays ○ PostGIS (geo data) ○ Geometry (point, line, etc) ○ HyperLogLog (extension) 2 b- Data Analysis with Postgres ● PL/SQL ● Full-text search (n-gram) ● Performance: ○ Parallel queries (pg9.6) ○ Materialized views ○ BRIN index ● Others: ○ DISTINCT ON ○ VALUES ○ generate_series() ○ Support FULL OUTER JOIN ○ Better EXPLAIN PostgreSQL is well suited for data analysis!
  39. 39. Analytics Database CSVs / Excels / Google Sheets Operational Data Data Warehouse Reporting / Analysis Data Science / ML Reporting / BI Event Logs (behavioural data) Live Databases Live Databases Production DBs Daily Snapshot Import Pre-aggregate Modify / Transform Data Pipeline (ETL) Data Analysis 1. Simple start 2. Rich features 3. Scale up
  40. 40. Why start with Postgres? 1. Simple to Get Started 2. Rich Features for Analytics – Data Pipeline (ETL) – Data Analysis 3. Scale Up (3) Scale(1) Start (2) Grow Data Growth
  41. 41. ● PostgreSQL downsides: ○ Optimized for transactional applications ○ Single-core execution; row-based storage ● CitusDB Extension ○ Automated data sharding and parallelization ○ Columnar Storage Format (better storage and performance) ● Vertica (HP) ○ Columnar Storage, Parallel Execution ○ Started by Michael Stonebraker (Postgres original author) ● Amazon Redshift ○ Fork of PostgreSQL 8.2 -- ParAccel DB ○ Columnar Storage & Parallel Executions 3- Scaling Up
  42. 42. Other Proprietary DW Databases (Relational) ● Greenplum ● Teradata ● Infobright ● Google BigQuery ● Aster Data ● Paraccel (Postgres fork) ● Vertica (from Postgres author) ● CitusDB (Postgres extension) ● Amazon Redshift (from Paraccel) 1. Simple start 2. Rich features 3. Scale up Related to Postgres
  43. 43. Compare: Popular SQL Databases PostgreSQL MySQL Oracle SQL Server License / Cost Free / Open-source Free / Open-source Expensive Expensive DW features Strong Weak Strong Strong
  44. 44. ● SQL features ○ WITH clause ○ Window functions ○ Aggregation functions ○ Statistical functions ● Data structures ○ JSON / JSONB ○ Arrays ○ PostGIS (geo data) ○ Geometry (point, line, etc) ○ HyperLogLog (extension) ● PL/SQL ● Full-text search (n-gram) ● Performance: ○ Parallel queries (pg9.6) ○ Materialized views ○ BRIN index ● Others: ○ DISTINCT ON ○ VALUES ○ generate_series() ○ Support FULL OUTER JOIN ○ Better EXPLAIN
  45. 45. ● SQL features ○ WITH clause ○ Window functions ○ Aggregation functions ○ Statistical functions ● Data structures ○ JSON / JSONB ○ Arrays ○ PostGIS (geo data) ○ Geometry (point, line, etc) ○ HyperLogLog (extension) ● PL/SQL ● Full-text search ● Performance: ○ Parallel queries (pg9.6) ○ Materialized views ○ BRIN index ● Others: ○ DISTINCT ON ○ VALUES ○ generate_series() ○ Support FULL OUTER JOIN ○ Better EXPLAIN
  46. 46. Summary 1. Simple to Get Started 2. Rich Features for Analytics – Data Pipeline (ETL) – Data Analysis 3. Easy to Scale Up (3) Scale(1) Start (2) Grow Data Growth
  47. 47. Summary (cont) ● Why starting with Postgres ● Scaling up to DW databases ● Comparing with other transactional DBs ● Not Cover: ○ How to setup PostgreSQL for DW ○ Performance Optimizations ○ Behavioural Data: Hadoop, Spark, HDFS
  48. 48. Huy Nguyen huy@holistics.io

×