SlideShare a Scribd company logo
1 of 50
Download to read offline
Huy Nguyen
CTO, Cofounder - Holistics.io
Why PostgreSQL for Analytics
Infrastructure (DW)?
Grokking TechTalk - Database Systems
Ho Chi Minh City - Aug 2016
● Cofounder
○ Data Reporting (BI) and Infrastructure SaaS
● Cofounder of Grokking Vietnam
○ Building community of world-class engineers in Vietnam
● Previous
○ Growth Team at Facebook (US)
○ Built Data Pipeline at Viki (Singapore)
About Me
Background: What is Analytics/DW?
- A Typical Web Application
Data-related Business Problems:
• Daily/weekly registered users by different platforms, countries?
• How many video uploads do we have everyday?
- A Typical Web Application
• Daily/weekly registered users by different platforms, countries?
• How many video uploads do we have everyday?
A Typical Data Pipeline
Analytics
Database
CSVs / Excels /
Google Sheets
Operational Data Data Warehouse
Reporting /
Analysis
Data Science / ML
Reporting / BI
Event Logs
(behavioural
data)
Live
Databases
Live
Databases
Production
DBs
Daily Snapshot
Import
Pre-aggregate
Modify / Transform
Analytics
Database
CSVs / Excels /
Google Sheets
Operational Data Data Warehouse
Reporting /
Analysis
Data Science / ML
Reporting / BI
Event Logs
(behavioural
data)
Live
Databases
Live
Databases
Production
DBs
Daily Snapshot
Import
Pre-aggregate
Modify / Transform
What database should we pick?
Transactional Applications vs Analytics Applications
Ref: http://www.slideshare.net/PGExperts/really-big-elephants-postgresql-dw-15833438 (slide 5)
Data:
● Many single-row writes
● Current, single data
Queries:
● Generated by user activities; 10
to 1000 users
● < 1s response time
● Short queries
Data:
● Few large batch imports
● Years of data, many sources
Queries:
● Generated by large reports; 1 to
10 users
● Queries run for hours
● Long queries
Ref: http://www.slideshare.net/PGExperts/really-big-elephants-postgresql-dw-15833438 (slide 8)
Complex Query...
Why start with Postgres?
1. Simple to Get Started
2. Rich Features for Analytics
– Data Pipeline (ETL)
– Data Analysis
3. Scale Up
(3) Scale(1) Start (2) Grow
Data Growth
1. Simple to Get Started
2. Rich Features for Analytics
– Data Pipeline (ETL)
– Data Analysis
3. Scale Up
Why start with Postgres?
(3) Scale(1) Start (2) Grow
Data Growth
1 Simple to Get Started
● Data requests grow gradually as your company grows
● Business users care about results (not backend)
Postgres:
● Free (open-source)
● Easy to setup
→ Need something quick to start, easy to fine-tune along the way
1. Simple start 2. Rich features 3. Scale up
1. Simple to Get Started
2. Rich Features for Analytics
– Data Pipeline (ETL)
– Data Analysis
3. Scale Up
Why start with Postgres?
(3) Scale(1) Start (2) Grow
Data Growth
Analytics
Database
CSVs / Excels /
Google Sheets
Operational Data Data Warehouse
Reporting /
Analysis
Data Science / ML
Reporting / BI
Event Logs
(behavioural
data)
Live
Databases
Live
Databases
Production
DBs
Daily Snapshot
Import
Pre-aggregate
Modify / Transform
Data Pipeline (ETL) Data Analysis
1. Simple start 2. Rich features 3. Scale up
Analytics
Database
CSVs / Excels /
Google Sheets
Data Warehouse
Event Logs
(behavioural
data)
Live
Databases
Live
Databases
Production
DBs
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
1. Simple start 2. Rich features 3. Scale up
● Managing Table Data: table partitioning
● Managing Disk Space: tablespace
● Write Performance: unlogged table
● Others: foreign data wrapper, point-in-time recovery
2 a- Data Pipeline (ETL) & Performance
1. Simple start 2. Rich features 3. Scale up
● Managing Table Data: table partitioning
● Managing Disk Space: tablespace
● Write Performance: unlogged table
● Others: foreign data wrapper, point-in-time recovery
2 a- Data Pipeline (ETL) & Performance
1. Simple start 2. Rich features 3. Scale up
Analytics tables hold lots of data
Managing Data Tables
pageviews_2015_06
pageviews_2015_07
pageviews_2015_08
pageviews_2015_09
Solution: Split (partition) to multiple tables
Problem:
Difficult to query data across multiple months
⇒ Table grows big quickly, difficult to manage !
pageviews
(+ 100k records a day)
date_d | country | user_id | browser | page_name | views
1. Simple start 2. Rich features 3. Scale up
Managing Data Tables: parent table
pageviews_2015_06
pageviews_2015_07
pageviews_2015_08
pageviews_2015_09
…
ALTER TABLE pageviews_2015_09 INHERIT video_plays;
ALTER TABLE pageviews_2015_09 ADD CONSTRAINT
CHECK date_d >= '2015-09-01'
AND date_d < '2015-10-01';
pageviews_parent (parent table)
1. Simple start 2. Rich features 3. Scale up
● Managing Table Data: table partitioning
● Managing Disk Space: tablespace
● Write Performance: unlogged table
● Others: foreign data wrapper, point-in-time recovery
2 a- Data Pipeline (ETL) & Performance
1. Simple start 2. Rich features 3. Scale up
Analytics DB holds lots of data; hardware spaces are limited
● SSD: fast, expensive
● SATA: cheap, slow
Data have different access
frequency
● Hot Data
● Warm Data
● Cold Data
Managing Disk-spaces
1. Simple start 2. Rich features 3. Scale up
Tablespace: Define where your tables are stored on disks
Managing Disk-spaces: tablespace
CREATE TABLESPACE hot_data LOCATION /disk0/ssd/
CREATE TABLESPACE warm_data LOCATION /disk1/sata2/
# beginning of the month
CREATE TABLE pageviews_2016_08 TABLESPACE hot_data;
ALTER TABLE pageviews_2016_07 TABLESPACE warm_data;
1. Simple start 2. Rich features 3. Scale up
Combining TABLESPACE and PARENT TABLE
pageviews_2015_06
pageviews_2015_07
pageviews_2015_08
pageviews_2015_09
…
pageviews_parent (parent table)
1. Simple start 2. Rich features 3. Scale up
● Managing Table Data: table partitioning
● Managing Disk Space: tablespace
● Write Performance: unlogged table
● Others: foreign data wrapper, point-in-time recovery
2 a- Data Pipeline (ETL) & Performance
1. Simple start 2. Rich features 3. Scale up
Analytics
Database
CSVs / Excels /
Google Sheets
Data Warehouse
Event Logs
(behavioural
data)
Live
Databases
Live
Databases
Production
DBs
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
Analytics tables can be rebuilt from source
1. Simple start 2. Rich features 3. Scale up
CREATE TABLE daily_summary(...) UNLOGGED;
INSERT INTO daily_summary …;
Write Performance: unlogged table
● Transactional Safety: Every update is 2 writes:
○ Update data inside table
○ Write WAL (Write Ahead Log)
● UNLOGGED TABLE
○ Skip WAL log
○ Improved Write Performance
http://pgsnaga.blogspot.com/2011/10/data-loading-into-unlogged-tables-and.html
1. Simple start 2. Rich features 3. Scale up
● Managing Table Data: table partitioning
● Managing Disk Space: tablespace
● Write Performance: unlogged table
● Others: foreign data wrapper, point-in-time recovery
2 a- Data Pipeline (ETL) & Performance
1. Simple start 2. Rich features 3. Scale up
● Extract / transform
● Aggregate / summarize
● Statistical analysis
2- b- Data Analysis (writing SQLs)
Analytics
Database
Data Warehouse
Reporting /
Analysis
Data Science / ML
Reporting / BI
1. Simple start 2. Rich features 3. Scale up
● SQL features
○ WITH clause
○ Window functions
○ Aggregation functions
○ Statistical functions
● Data structures
○ JSON / JSONB
○ Arrays
○ PostGIS (geo data)
○ Geometry (point, line, etc)
○ HyperLogLog (extension)
2- b - Data Analysis with Postgres
● PL/SQL
● Full-text search (n-gram)
● Performance:
○ Parallel queries (pg9.6)
○ Materialized views
○ BRIN index
● Others:
○ DISTINCT ON
○ VALUES
○ generate_series()
○ Support FULL OUTER JOIN
○ Better EXPLAIN
1. Simple start 2. Rich features 3. Scale up
SELECT ...
FROM (SELECT ...
FROM t1
JOIN (SELECT ... FROM ...) a
ON (...)
) b
JOIN (SELECT ... FROM ...) c ON (...)
CTE - Problem with Nested Queries
Nested queries are
a) hard to read
b) cannot be reused
1. Simple start 2. Rich features 3. Scale up
CTE - Common Table Expressions (WITH clause)
WITH a AS (
SELECT ... FROM ...
), b AS (
SELECT ...
FROM t1 JOIN a ON (...)
), c AS (
SELECT ... FROM ...
)
SELECT ... FROM b JOIN c ON ...
● SQL’s “private methods”
● WITH view can be referred
multiple times
● Allows chaining instead of
nesting
1. Simple start 2. Rich features 3. Scale up
CTE (cont.)
● Recursive CTE
● Writeable CTE
1. Simple start 2. Rich features 3. Scale up
# move data from A to B
WITH deleted_rows AS (
DELETE FROM a WHERE ...
RETURNING *
)
INSERT INTO b
SELECT * FROM deleted_rows;
SELECT
gender,
COUNT(1) AS signups
FROM users
GROUP BY 1
● GROUP BY aggregate: reduce a
partition of data into 1 value
Limitation of GROUP BY aggregate
What if we want to work through each row of each partition?
1. Simple start 2. Rich features 3. Scale up
● Window functions: moving frame
of 1 partition data
● Examples:
○ Calculate moving average
○ Cumulative sum
○ Ranking by partition
○ …
Window functions
1. Simple start 2. Rich features 3. Scale up
SELECT
created_at::date AS date_d,
COUNT(1) AS daily_signups,
SUM(COUNT(1)) OVER
(ORDER BY dated_d) AS cumulative_signups
FROM users U
GROUP BY 1
ORDER BY 1
| date_d | daily_signups | cumulative_signups |
| 2016-08-01 | 100 | 100 |
| 2016-08-02 | 50 | 150 |
| 2016-08-03 | 80 | 230 |
Example: Cumulative Sum
CREATE TABLE users (
id INT,
gender VARCHAR(10),
created_at TIMESTAMP
);
1. Simple start 2. Rich features 3. Scale up
SELECT
gender,
name,
RANK() OVER (PARTITION BY gender
ORDER BY created_at) AS signup_rnk
FROM users U ORDER BY 1, 3;
| gender | name | signup_rnk |
| male | Hung | 1 |
| male | Son | 2 |
| ... |
| female | Lan | 1 |
| female | Tuyet | 2 |
Example: Group by Gender and rank by signup time
CREATE TABLE users (
id INT,
name VARCHAR,
gender VARCHAR(10),
created_at TIMESTAMP
);
1. Simple start 2. Rich features 3. Scale up
● SQL features
○ WITH clause
○ Window functions
○ Aggregation functions
○ Statistical functions
● Data structures
○ JSON / JSONB
○ Arrays
○ PostGIS (geo data)
○ Geometry (point, line, etc)
○ HyperLogLog (extension)
2 b- Data Analysis with Postgres
● PL/SQL
● Full-text search (n-gram)
● Performance:
○ Parallel queries (pg9.6)
○ Materialized views
○ BRIN index
● Others:
○ DISTINCT ON
○ VALUES
○ generate_series()
○ Support FULL OUTER JOIN
○ Better EXPLAIN
PostgreSQL is well suited for data analysis!
Analytics
Database
CSVs / Excels /
Google Sheets
Operational Data Data Warehouse
Reporting /
Analysis
Data Science / ML
Reporting / BI
Event Logs
(behavioural
data)
Live
Databases
Live
Databases
Production
DBs
Daily Snapshot
Import
Pre-aggregate
Modify / Transform
Data Pipeline (ETL) Data Analysis
1. Simple start 2. Rich features 3. Scale up
Why start with Postgres?
1. Simple to Get Started
2. Rich Features for Analytics
– Data Pipeline (ETL)
– Data Analysis
3. Scale Up
(3) Scale(1) Start (2) Grow
Data Growth
● PostgreSQL downsides:
○ Optimized for transactional applications
○ Single-core execution; row-based storage
● CitusDB Extension
○ Automated data sharding and parallelization
○ Columnar Storage Format (better storage and performance)
● Vertica (HP)
○ Columnar Storage, Parallel Execution
○ Started by Michael Stonebraker (Postgres original author)
● Amazon Redshift
○ Fork of PostgreSQL 8.2 -- ParAccel DB
○ Columnar Storage & Parallel Executions
3- Scaling Up
Other Proprietary DW Databases (Relational)
● Greenplum
● Teradata
● Infobright
● Google BigQuery
● Aster Data
● Paraccel (Postgres fork)
● Vertica (from Postgres author)
● CitusDB (Postgres extension)
● Amazon Redshift (from Paraccel)
1. Simple start 2. Rich features 3. Scale up
Related to Postgres
Compare: Popular SQL Databases
PostgreSQL MySQL Oracle SQL Server
License /
Cost
Free / Open-source Free / Open-source Expensive Expensive
DW features Strong Weak Strong Strong
● SQL features
○ WITH clause
○ Window functions
○ Aggregation functions
○ Statistical functions
● Data structures
○ JSON / JSONB
○ Arrays
○ PostGIS (geo data)
○ Geometry (point, line, etc)
○ HyperLogLog (extension)
● PL/SQL
● Full-text search (n-gram)
● Performance:
○ Parallel queries (pg9.6)
○ Materialized views
○ BRIN index
● Others:
○ DISTINCT ON
○ VALUES
○ generate_series()
○ Support FULL OUTER JOIN
○ Better EXPLAIN
● SQL features
○ WITH clause
○ Window functions
○ Aggregation functions
○ Statistical functions
● Data structures
○ JSON / JSONB
○ Arrays
○ PostGIS (geo data)
○ Geometry (point, line, etc)
○ HyperLogLog (extension)
● PL/SQL
● Full-text search
● Performance:
○ Parallel queries (pg9.6)
○ Materialized views
○ BRIN index
● Others:
○ DISTINCT ON
○ VALUES
○ generate_series()
○ Support FULL OUTER JOIN
○ Better EXPLAIN
Summary
1. Simple to Get Started
2. Rich Features for Analytics
– Data Pipeline (ETL)
– Data Analysis
3. Easy to Scale Up
(3) Scale(1) Start (2) Grow
Data Growth
Summary (cont)
● Why starting with Postgres
● Scaling up to DW databases
● Comparing with other transactional DBs
● Not Cover:
○ How to setup PostgreSQL for DW
○ Performance Optimizations
○ Behavioural Data: Hadoop, Spark, HDFS
Huy Nguyen
huy@holistics.io

More Related Content

What's hot

Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDatabricks
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache CalciteJulian Hyde
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
Redshift performance tuning
Redshift performance tuningRedshift performance tuning
Redshift performance tuningCarlos del Cacho
 
SQL Server In-Memory OLTP: What Every SQL Professional Should Know
SQL Server In-Memory OLTP: What Every SQL Professional Should KnowSQL Server In-Memory OLTP: What Every SQL Professional Should Know
SQL Server In-Memory OLTP: What Every SQL Professional Should KnowBob Ward
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
 
Major features postgres 11
Major features postgres 11Major features postgres 11
Major features postgres 11EDB
 
Inside SQL Server In-Memory OLTP
Inside SQL Server In-Memory OLTPInside SQL Server In-Memory OLTP
Inside SQL Server In-Memory OLTPBob Ward
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
PostgreSQL Write-Ahead Log (Heikki Linnakangas)
PostgreSQL Write-Ahead Log (Heikki Linnakangas) PostgreSQL Write-Ahead Log (Heikki Linnakangas)
PostgreSQL Write-Ahead Log (Heikki Linnakangas) Ontico
 
PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs PGConf APAC
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Hadoop databases for oracle DBAs
Hadoop databases for oracle DBAsHadoop databases for oracle DBAs
Hadoop databases for oracle DBAsMaxym Kharchenko
 
Json in Postgres - the Roadmap
 Json in Postgres - the Roadmap Json in Postgres - the Roadmap
Json in Postgres - the RoadmapEDB
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveWill Du
 
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...Equnix Business Solutions
 
Faceted search with Oracle InMemory option
Faceted search with Oracle InMemory optionFaceted search with Oracle InMemory option
Faceted search with Oracle InMemory optionAlexander Tokarev
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010John Sichi
 

What's hot (20)

Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Redshift performance tuning
Redshift performance tuningRedshift performance tuning
Redshift performance tuning
 
SQL Server In-Memory OLTP: What Every SQL Professional Should Know
SQL Server In-Memory OLTP: What Every SQL Professional Should KnowSQL Server In-Memory OLTP: What Every SQL Professional Should Know
SQL Server In-Memory OLTP: What Every SQL Professional Should Know
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Major features postgres 11
Major features postgres 11Major features postgres 11
Major features postgres 11
 
Inside SQL Server In-Memory OLTP
Inside SQL Server In-Memory OLTPInside SQL Server In-Memory OLTP
Inside SQL Server In-Memory OLTP
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
PostgreSQL Write-Ahead Log (Heikki Linnakangas)
PostgreSQL Write-Ahead Log (Heikki Linnakangas) PostgreSQL Write-Ahead Log (Heikki Linnakangas)
PostgreSQL Write-Ahead Log (Heikki Linnakangas)
 
PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Hadoop databases for oracle DBAs
Hadoop databases for oracle DBAsHadoop databases for oracle DBAs
Hadoop databases for oracle DBAs
 
Json in Postgres - the Roadmap
 Json in Postgres - the Roadmap Json in Postgres - the Roadmap
Json in Postgres - the Roadmap
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
 
Faceted search with Oracle InMemory option
Faceted search with Oracle InMemory optionFaceted search with Oracle InMemory option
Faceted search with Oracle InMemory option
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
 

Viewers also liked

PostgreSQL 9.6 Performance-Scalability Improvements
PostgreSQL 9.6 Performance-Scalability ImprovementsPostgreSQL 9.6 Performance-Scalability Improvements
PostgreSQL 9.6 Performance-Scalability ImprovementsPGConf APAC
 
5 Postgres DBA Tips
5 Postgres DBA Tips5 Postgres DBA Tips
5 Postgres DBA TipsEDB
 
Making Postgres Central in Your Data Center
Making Postgres Central in Your Data CenterMaking Postgres Central in Your Data Center
Making Postgres Central in Your Data CenterEDB
 
Bn 1016 demo postgre sql-online-training
Bn 1016 demo  postgre sql-online-trainingBn 1016 demo  postgre sql-online-training
Bn 1016 demo postgre sql-online-trainingconline training
 
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data AnalyticsSupersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analyticsmason_s
 
One Coin One Brick project
One Coin One Brick projectOne Coin One Brick project
One Coin One Brick projectHuy Nguyen
 
What's New in PostgreSQL 9.6
What's New in PostgreSQL 9.6What's New in PostgreSQL 9.6
What's New in PostgreSQL 9.6EDB
 

Viewers also liked (7)

PostgreSQL 9.6 Performance-Scalability Improvements
PostgreSQL 9.6 Performance-Scalability ImprovementsPostgreSQL 9.6 Performance-Scalability Improvements
PostgreSQL 9.6 Performance-Scalability Improvements
 
5 Postgres DBA Tips
5 Postgres DBA Tips5 Postgres DBA Tips
5 Postgres DBA Tips
 
Making Postgres Central in Your Data Center
Making Postgres Central in Your Data CenterMaking Postgres Central in Your Data Center
Making Postgres Central in Your Data Center
 
Bn 1016 demo postgre sql-online-training
Bn 1016 demo  postgre sql-online-trainingBn 1016 demo  postgre sql-online-training
Bn 1016 demo postgre sql-online-training
 
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data AnalyticsSupersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
 
One Coin One Brick project
One Coin One Brick projectOne Coin One Brick project
One Coin One Brick project
 
What's New in PostgreSQL 9.6
What's New in PostgreSQL 9.6What's New in PostgreSQL 9.6
What's New in PostgreSQL 9.6
 

Similar to Why PostgreSQL for Analytics Infrastructure (DW)?

Building Analytics Infrastructure for Growing Tech Companies
Building Analytics Infrastructure for Growing Tech CompaniesBuilding Analytics Infrastructure for Growing Tech Companies
Building Analytics Infrastructure for Growing Tech CompaniesHolistics Software
 
Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!Knoldus Inc.
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Google BigQuery 101 & What’s New
Google BigQuery 101 & What’s NewGoogle BigQuery 101 & What’s New
Google BigQuery 101 & What’s NewDoiT International
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyGuillaume Lefranc
 
[JSS2015] In memory and operational analytics
[JSS2015] In memory and operational analytics[JSS2015] In memory and operational analytics
[JSS2015] In memory and operational analyticsGUSS
 
Jss 2015 in memory and operational analytics
Jss 2015   in memory and operational analyticsJss 2015   in memory and operational analytics
Jss 2015 in memory and operational analyticsDavid Barbarin
 
High Performance and Scalability Database Design
High Performance and Scalability Database DesignHigh Performance and Scalability Database Design
High Performance and Scalability Database DesignTung Ns
 
Data Analytics with DBMS
Data Analytics with DBMSData Analytics with DBMS
Data Analytics with DBMSGLC Networks
 
FinTech Data Challenges @ Nerdwallet
FinTech Data Challenges @ Nerdwallet FinTech Data Challenges @ Nerdwallet
FinTech Data Challenges @ Nerdwallet Vaibhav Jajoo
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsAlluxio, Inc.
 
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon RedshiftAmazon Web Services
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Amazon Web Services
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Query generation across multiple data stores [SBTB 2016]
Query generation across multiple data stores [SBTB 2016]Query generation across multiple data stores [SBTB 2016]
Query generation across multiple data stores [SBTB 2016]Hiral Patel
 
Gobblin @ NerdWallet (Nov 2015)
Gobblin @ NerdWallet (Nov 2015)Gobblin @ NerdWallet (Nov 2015)
Gobblin @ NerdWallet (Nov 2015)NerdWalletHQ
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolThierry Badard
 
Geek Sync I Polybase and Time Travel (Temporal Tables)
Geek Sync I Polybase and Time Travel (Temporal Tables)Geek Sync I Polybase and Time Travel (Temporal Tables)
Geek Sync I Polybase and Time Travel (Temporal Tables)IDERA Software
 

Similar to Why PostgreSQL for Analytics Infrastructure (DW)? (20)

Building Analytics Infrastructure for Growing Tech Companies
Building Analytics Infrastructure for Growing Tech CompaniesBuilding Analytics Infrastructure for Growing Tech Companies
Building Analytics Infrastructure for Growing Tech Companies
 
Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Google BigQuery 101 & What’s New
Google BigQuery 101 & What’s NewGoogle BigQuery 101 & What’s New
Google BigQuery 101 & What’s New
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
 
[JSS2015] In memory and operational analytics
[JSS2015] In memory and operational analytics[JSS2015] In memory and operational analytics
[JSS2015] In memory and operational analytics
 
Jss 2015 in memory and operational analytics
Jss 2015   in memory and operational analyticsJss 2015   in memory and operational analytics
Jss 2015 in memory and operational analytics
 
High Performance and Scalability Database Design
High Performance and Scalability Database DesignHigh Performance and Scalability Database Design
High Performance and Scalability Database Design
 
Data Analytics with DBMS
Data Analytics with DBMSData Analytics with DBMS
Data Analytics with DBMS
 
FinTech Data Challenges @ Nerdwallet
FinTech Data Challenges @ Nerdwallet FinTech Data Challenges @ Nerdwallet
FinTech Data Challenges @ Nerdwallet
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data Platforms
 
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Query generation across multiple data stores [SBTB 2016]
Query generation across multiple data stores [SBTB 2016]Query generation across multiple data stores [SBTB 2016]
Query generation across multiple data stores [SBTB 2016]
 
Gobblin @ NerdWallet (Nov 2015)
Gobblin @ NerdWallet (Nov 2015)Gobblin @ NerdWallet (Nov 2015)
Gobblin @ NerdWallet (Nov 2015)
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL tool
 
Geek Sync I Polybase and Time Travel (Temporal Tables)
Geek Sync I Polybase and Time Travel (Temporal Tables)Geek Sync I Polybase and Time Travel (Temporal Tables)
Geek Sync I Polybase and Time Travel (Temporal Tables)
 

Recently uploaded

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...HyderabadDolls
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 

Recently uploaded (20)

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 

Why PostgreSQL for Analytics Infrastructure (DW)?

  • 1. Huy Nguyen CTO, Cofounder - Holistics.io Why PostgreSQL for Analytics Infrastructure (DW)? Grokking TechTalk - Database Systems Ho Chi Minh City - Aug 2016
  • 2. ● Cofounder ○ Data Reporting (BI) and Infrastructure SaaS ● Cofounder of Grokking Vietnam ○ Building community of world-class engineers in Vietnam ● Previous ○ Growth Team at Facebook (US) ○ Built Data Pipeline at Viki (Singapore) About Me
  • 3. Background: What is Analytics/DW?
  • 4. - A Typical Web Application Data-related Business Problems: • Daily/weekly registered users by different platforms, countries? • How many video uploads do we have everyday?
  • 5. - A Typical Web Application • Daily/weekly registered users by different platforms, countries? • How many video uploads do we have everyday?
  • 6.
  • 7. A Typical Data Pipeline
  • 8. Analytics Database CSVs / Excels / Google Sheets Operational Data Data Warehouse Reporting / Analysis Data Science / ML Reporting / BI Event Logs (behavioural data) Live Databases Live Databases Production DBs Daily Snapshot Import Pre-aggregate Modify / Transform
  • 9. Analytics Database CSVs / Excels / Google Sheets Operational Data Data Warehouse Reporting / Analysis Data Science / ML Reporting / BI Event Logs (behavioural data) Live Databases Live Databases Production DBs Daily Snapshot Import Pre-aggregate Modify / Transform What database should we pick?
  • 10. Transactional Applications vs Analytics Applications Ref: http://www.slideshare.net/PGExperts/really-big-elephants-postgresql-dw-15833438 (slide 5) Data: ● Many single-row writes ● Current, single data Queries: ● Generated by user activities; 10 to 1000 users ● < 1s response time ● Short queries Data: ● Few large batch imports ● Years of data, many sources Queries: ● Generated by large reports; 1 to 10 users ● Queries run for hours ● Long queries
  • 12. Why start with Postgres? 1. Simple to Get Started 2. Rich Features for Analytics – Data Pipeline (ETL) – Data Analysis 3. Scale Up (3) Scale(1) Start (2) Grow Data Growth
  • 13. 1. Simple to Get Started 2. Rich Features for Analytics – Data Pipeline (ETL) – Data Analysis 3. Scale Up Why start with Postgres? (3) Scale(1) Start (2) Grow Data Growth
  • 14. 1 Simple to Get Started ● Data requests grow gradually as your company grows ● Business users care about results (not backend) Postgres: ● Free (open-source) ● Easy to setup → Need something quick to start, easy to fine-tune along the way 1. Simple start 2. Rich features 3. Scale up
  • 15. 1. Simple to Get Started 2. Rich Features for Analytics – Data Pipeline (ETL) – Data Analysis 3. Scale Up Why start with Postgres? (3) Scale(1) Start (2) Grow Data Growth
  • 16. Analytics Database CSVs / Excels / Google Sheets Operational Data Data Warehouse Reporting / Analysis Data Science / ML Reporting / BI Event Logs (behavioural data) Live Databases Live Databases Production DBs Daily Snapshot Import Pre-aggregate Modify / Transform Data Pipeline (ETL) Data Analysis 1. Simple start 2. Rich features 3. Scale up
  • 17. Analytics Database CSVs / Excels / Google Sheets Data Warehouse Event Logs (behavioural data) Live Databases Live Databases Production DBs Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table 1. Simple start 2. Rich features 3. Scale up
  • 18. ● Managing Table Data: table partitioning ● Managing Disk Space: tablespace ● Write Performance: unlogged table ● Others: foreign data wrapper, point-in-time recovery 2 a- Data Pipeline (ETL) & Performance 1. Simple start 2. Rich features 3. Scale up
  • 19. ● Managing Table Data: table partitioning ● Managing Disk Space: tablespace ● Write Performance: unlogged table ● Others: foreign data wrapper, point-in-time recovery 2 a- Data Pipeline (ETL) & Performance 1. Simple start 2. Rich features 3. Scale up
  • 20. Analytics tables hold lots of data Managing Data Tables pageviews_2015_06 pageviews_2015_07 pageviews_2015_08 pageviews_2015_09 Solution: Split (partition) to multiple tables Problem: Difficult to query data across multiple months ⇒ Table grows big quickly, difficult to manage ! pageviews (+ 100k records a day) date_d | country | user_id | browser | page_name | views 1. Simple start 2. Rich features 3. Scale up
  • 21. Managing Data Tables: parent table pageviews_2015_06 pageviews_2015_07 pageviews_2015_08 pageviews_2015_09 … ALTER TABLE pageviews_2015_09 INHERIT video_plays; ALTER TABLE pageviews_2015_09 ADD CONSTRAINT CHECK date_d >= '2015-09-01' AND date_d < '2015-10-01'; pageviews_parent (parent table) 1. Simple start 2. Rich features 3. Scale up
  • 22. ● Managing Table Data: table partitioning ● Managing Disk Space: tablespace ● Write Performance: unlogged table ● Others: foreign data wrapper, point-in-time recovery 2 a- Data Pipeline (ETL) & Performance 1. Simple start 2. Rich features 3. Scale up
  • 23. Analytics DB holds lots of data; hardware spaces are limited ● SSD: fast, expensive ● SATA: cheap, slow Data have different access frequency ● Hot Data ● Warm Data ● Cold Data Managing Disk-spaces 1. Simple start 2. Rich features 3. Scale up
  • 24. Tablespace: Define where your tables are stored on disks Managing Disk-spaces: tablespace CREATE TABLESPACE hot_data LOCATION /disk0/ssd/ CREATE TABLESPACE warm_data LOCATION /disk1/sata2/ # beginning of the month CREATE TABLE pageviews_2016_08 TABLESPACE hot_data; ALTER TABLE pageviews_2016_07 TABLESPACE warm_data; 1. Simple start 2. Rich features 3. Scale up
  • 25. Combining TABLESPACE and PARENT TABLE pageviews_2015_06 pageviews_2015_07 pageviews_2015_08 pageviews_2015_09 … pageviews_parent (parent table) 1. Simple start 2. Rich features 3. Scale up
  • 26. ● Managing Table Data: table partitioning ● Managing Disk Space: tablespace ● Write Performance: unlogged table ● Others: foreign data wrapper, point-in-time recovery 2 a- Data Pipeline (ETL) & Performance 1. Simple start 2. Rich features 3. Scale up
  • 27. Analytics Database CSVs / Excels / Google Sheets Data Warehouse Event Logs (behavioural data) Live Databases Live Databases Production DBs Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Analytics tables can be rebuilt from source 1. Simple start 2. Rich features 3. Scale up
  • 28. CREATE TABLE daily_summary(...) UNLOGGED; INSERT INTO daily_summary …; Write Performance: unlogged table ● Transactional Safety: Every update is 2 writes: ○ Update data inside table ○ Write WAL (Write Ahead Log) ● UNLOGGED TABLE ○ Skip WAL log ○ Improved Write Performance http://pgsnaga.blogspot.com/2011/10/data-loading-into-unlogged-tables-and.html 1. Simple start 2. Rich features 3. Scale up
  • 29. ● Managing Table Data: table partitioning ● Managing Disk Space: tablespace ● Write Performance: unlogged table ● Others: foreign data wrapper, point-in-time recovery 2 a- Data Pipeline (ETL) & Performance 1. Simple start 2. Rich features 3. Scale up
  • 30. ● Extract / transform ● Aggregate / summarize ● Statistical analysis 2- b- Data Analysis (writing SQLs) Analytics Database Data Warehouse Reporting / Analysis Data Science / ML Reporting / BI 1. Simple start 2. Rich features 3. Scale up
  • 31. ● SQL features ○ WITH clause ○ Window functions ○ Aggregation functions ○ Statistical functions ● Data structures ○ JSON / JSONB ○ Arrays ○ PostGIS (geo data) ○ Geometry (point, line, etc) ○ HyperLogLog (extension) 2- b - Data Analysis with Postgres ● PL/SQL ● Full-text search (n-gram) ● Performance: ○ Parallel queries (pg9.6) ○ Materialized views ○ BRIN index ● Others: ○ DISTINCT ON ○ VALUES ○ generate_series() ○ Support FULL OUTER JOIN ○ Better EXPLAIN 1. Simple start 2. Rich features 3. Scale up
  • 32. SELECT ... FROM (SELECT ... FROM t1 JOIN (SELECT ... FROM ...) a ON (...) ) b JOIN (SELECT ... FROM ...) c ON (...) CTE - Problem with Nested Queries Nested queries are a) hard to read b) cannot be reused 1. Simple start 2. Rich features 3. Scale up
  • 33. CTE - Common Table Expressions (WITH clause) WITH a AS ( SELECT ... FROM ... ), b AS ( SELECT ... FROM t1 JOIN a ON (...) ), c AS ( SELECT ... FROM ... ) SELECT ... FROM b JOIN c ON ... ● SQL’s “private methods” ● WITH view can be referred multiple times ● Allows chaining instead of nesting 1. Simple start 2. Rich features 3. Scale up
  • 34. CTE (cont.) ● Recursive CTE ● Writeable CTE 1. Simple start 2. Rich features 3. Scale up # move data from A to B WITH deleted_rows AS ( DELETE FROM a WHERE ... RETURNING * ) INSERT INTO b SELECT * FROM deleted_rows;
  • 35. SELECT gender, COUNT(1) AS signups FROM users GROUP BY 1 ● GROUP BY aggregate: reduce a partition of data into 1 value Limitation of GROUP BY aggregate What if we want to work through each row of each partition? 1. Simple start 2. Rich features 3. Scale up
  • 36. ● Window functions: moving frame of 1 partition data ● Examples: ○ Calculate moving average ○ Cumulative sum ○ Ranking by partition ○ … Window functions 1. Simple start 2. Rich features 3. Scale up
  • 37. SELECT created_at::date AS date_d, COUNT(1) AS daily_signups, SUM(COUNT(1)) OVER (ORDER BY dated_d) AS cumulative_signups FROM users U GROUP BY 1 ORDER BY 1 | date_d | daily_signups | cumulative_signups | | 2016-08-01 | 100 | 100 | | 2016-08-02 | 50 | 150 | | 2016-08-03 | 80 | 230 | Example: Cumulative Sum CREATE TABLE users ( id INT, gender VARCHAR(10), created_at TIMESTAMP ); 1. Simple start 2. Rich features 3. Scale up
  • 38. SELECT gender, name, RANK() OVER (PARTITION BY gender ORDER BY created_at) AS signup_rnk FROM users U ORDER BY 1, 3; | gender | name | signup_rnk | | male | Hung | 1 | | male | Son | 2 | | ... | | female | Lan | 1 | | female | Tuyet | 2 | Example: Group by Gender and rank by signup time CREATE TABLE users ( id INT, name VARCHAR, gender VARCHAR(10), created_at TIMESTAMP ); 1. Simple start 2. Rich features 3. Scale up
  • 39. ● SQL features ○ WITH clause ○ Window functions ○ Aggregation functions ○ Statistical functions ● Data structures ○ JSON / JSONB ○ Arrays ○ PostGIS (geo data) ○ Geometry (point, line, etc) ○ HyperLogLog (extension) 2 b- Data Analysis with Postgres ● PL/SQL ● Full-text search (n-gram) ● Performance: ○ Parallel queries (pg9.6) ○ Materialized views ○ BRIN index ● Others: ○ DISTINCT ON ○ VALUES ○ generate_series() ○ Support FULL OUTER JOIN ○ Better EXPLAIN PostgreSQL is well suited for data analysis!
  • 40. Analytics Database CSVs / Excels / Google Sheets Operational Data Data Warehouse Reporting / Analysis Data Science / ML Reporting / BI Event Logs (behavioural data) Live Databases Live Databases Production DBs Daily Snapshot Import Pre-aggregate Modify / Transform Data Pipeline (ETL) Data Analysis 1. Simple start 2. Rich features 3. Scale up
  • 41. Why start with Postgres? 1. Simple to Get Started 2. Rich Features for Analytics – Data Pipeline (ETL) – Data Analysis 3. Scale Up (3) Scale(1) Start (2) Grow Data Growth
  • 42. ● PostgreSQL downsides: ○ Optimized for transactional applications ○ Single-core execution; row-based storage ● CitusDB Extension ○ Automated data sharding and parallelization ○ Columnar Storage Format (better storage and performance) ● Vertica (HP) ○ Columnar Storage, Parallel Execution ○ Started by Michael Stonebraker (Postgres original author) ● Amazon Redshift ○ Fork of PostgreSQL 8.2 -- ParAccel DB ○ Columnar Storage & Parallel Executions 3- Scaling Up
  • 43. Other Proprietary DW Databases (Relational) ● Greenplum ● Teradata ● Infobright ● Google BigQuery ● Aster Data ● Paraccel (Postgres fork) ● Vertica (from Postgres author) ● CitusDB (Postgres extension) ● Amazon Redshift (from Paraccel) 1. Simple start 2. Rich features 3. Scale up Related to Postgres
  • 44. Compare: Popular SQL Databases PostgreSQL MySQL Oracle SQL Server License / Cost Free / Open-source Free / Open-source Expensive Expensive DW features Strong Weak Strong Strong
  • 45. ● SQL features ○ WITH clause ○ Window functions ○ Aggregation functions ○ Statistical functions ● Data structures ○ JSON / JSONB ○ Arrays ○ PostGIS (geo data) ○ Geometry (point, line, etc) ○ HyperLogLog (extension) ● PL/SQL ● Full-text search (n-gram) ● Performance: ○ Parallel queries (pg9.6) ○ Materialized views ○ BRIN index ● Others: ○ DISTINCT ON ○ VALUES ○ generate_series() ○ Support FULL OUTER JOIN ○ Better EXPLAIN
  • 46. ● SQL features ○ WITH clause ○ Window functions ○ Aggregation functions ○ Statistical functions ● Data structures ○ JSON / JSONB ○ Arrays ○ PostGIS (geo data) ○ Geometry (point, line, etc) ○ HyperLogLog (extension) ● PL/SQL ● Full-text search ● Performance: ○ Parallel queries (pg9.6) ○ Materialized views ○ BRIN index ● Others: ○ DISTINCT ON ○ VALUES ○ generate_series() ○ Support FULL OUTER JOIN ○ Better EXPLAIN
  • 47.
  • 48. Summary 1. Simple to Get Started 2. Rich Features for Analytics – Data Pipeline (ETL) – Data Analysis 3. Easy to Scale Up (3) Scale(1) Start (2) Grow Data Growth
  • 49. Summary (cont) ● Why starting with Postgres ● Scaling up to DW databases ● Comparing with other transactional DBs ● Not Cover: ○ How to setup PostgreSQL for DW ○ Performance Optimizations ○ Behavioural Data: Hadoop, Spark, HDFS