Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

10 Reasons to Start Your Analytics Project with PostgreSQL

3,311 views

Published on

My talk slides @ HKOSCon 2016 香港開源年會
https://2016.opensource.hk/

Published in: Technology
  • Be the first to comment

10 Reasons to Start Your Analytics Project with PostgreSQL

  1. 1. 10 Reasons To Start Your Analytics Project with PostgreSQL Satoshi Nagayasu @snaga HKOSCon 2016
  2. 2. Agenda • Collecting Data / Database Federation • Building Data Warehouse and Data Mart • Writing Queries / SQL Features • Performance • In-Database Analytics
  3. 3. Collecting Data / Database Federation Foreign Data Wrapper Unlogged Table
  4. 4. Foreign Data Wrapper • Connects external data sources (RDBMS, NoSQL, files, etc) to the PostgreSQL executor. • Allows SELECT/INSERT/UPDATE/DELETE operations for external tables. PostgreSQL Oracle MySQL HDFS https://wiki.postgresql.org/wiki/Foreign_data_wrappers
  5. 5. Unlogged Table • Does not record XLOG. • Has better performance compared to regular table. • Will be truncated after crash recovery. http://pgsnaga.blogspot.jp/2011/10/data-loading-into-unlogged-tables-and.html
  6. 6. Building Data Warehouse and Data Mart Materialized Views Transactional DDLs
  7. 7. Materialized View • Defines a view with caching records. • Allows to avoid running complicated queries and aggregations every time. • Requires updating cache by the users. Table View Table Table Materialized View Table Query Query Cache
  8. 8. Transactional DDLs • Most of DDLs can be performed in transaction in PostgreSQL. • Schema can be modified with keeping atomicity even online. (commit or rollback) • Transactional DDLs would help DBAs manage their schema easier.
  9. 9. Writing Queries / SQL Features Rich SQL features Compatibility with SQL standard
  10. 10. Writing Queries / SQL Features • Rich SQL features – Subqueries – WITH clauses (Common Table Expressions, CTEs) – Many aggregation functions – Window functions • JSON support • Compatibility with the SQL standard
  11. 11. WITH clause • Defines a temporary table for a query. • May make a better performance compared to using the same subquery more than once. WITH foo AS ( SELECT ... FROM ... GROUP BY ... ) SELECT ... FROM foo WHERE ... UNION ALL SELECT ... FROM foo WHERE ...; https://www.postgresql.org/docs/9.5/static/queries-with.html
  12. 12. Many Aggregations • New in 9.4 – percentile_cont() – percentile_disc() – mode() – rank() – dense_rank() – percent_rank() – cume_dist() • New in 9.5 – ROLLUP() – CUBE() – GROUPING SETS() https://www.postgresql.org/docs/9.5/static/functions-aggregate.html
  13. 13. ROLLUP • Calculates total/subtotal values
  14. 14. CUBE • Calculates for all combinations of the specified columns
  15. 15. GROUPING SETS • Runs multiple GROUP BY queries at once Two GROUP BYs at once.
  16. 16. JSON data type testdb=# create table t1 ( j jsonb ); CREATE TABLE testdb=# insert into t1 values ('{ "key1": "value1", "key2": "value2" }'); INSERT 0 1 testdb=# select * from t1; j -------------------------------------- {"key1": "value1", "key2": "value2"} (1 row) testdb=# select j->>'key2' key2 from t1; key2 -------- value2 (1 row)
  17. 17. JSON data type testdb=# select n_nationkey,n_name from nation where n_nationkey = 12; n_nationkey | n_name -------------+--------------------------- 12 | JAPAN (1 row) testdb=# select jsonb_build_object('n_nationkey', n_nationkey, 'n_name', n_name) from nation where n_nationkey = 12; jsonb_build_object ------------------------------------------------------------ {"n_name": "JAPAN ", "n_nationkey": 12} (1 row)
  18. 18. JSON data type Operator Description 9.4 -> Get an element by key as a JSON object ->> Get an element by key as a text object #> Get an element by path as a JSON object #>> Get an element by path as a text object <@, @> Evaluate whether a JSON object contains a key/value pair ? Evaluate whether a JSON object contains a key or a value ?| Evaluate whether a JSON object contains ANY of keys or values ?& Evaluate whether a JSON object contains ALL of keys or values 9.5 || Insert or Update an element to a JSON object - Delete an element by key from a JSON object #- Delete an element by path from a JSON object http://www.postgresql.org/docs/9.5/static/functions-json.html
  19. 19. JSON data type • Allows to collect data without defining schema. • “Schema-less”, “Schema on Read” or “Schema- later”. • Still accessible with SQL. JSON Data Type Fluentd pg-Json plugin View (Schema) App App Fluentd
  20. 20. Performance 3 types of Join Full text search (n-gram) Table Partition BRIN Index Table Sample Parallel Queries
  21. 21. 3 types of Join • Nested Loop (NL) Join – Works good when joining small number of records between tables with indexes. • Merge Join • Hash Join – Works better than NL when joining large number of records between large tables.
  22. 22. Full-text search (n-gram) • Splits a text into N-char tokens and build an index. – Pg_trgm: Tri-gram (3-char) – Pg_bigm: Bi-gram (2-char) • CJK has lots of 2-char words, so Bi-gram may be useful rather than Tri-gram. – CJK: Chinese, Japanese and Korean. Pg_trgm: https://www.postgresql.org/docs/9.5/static/pgtrgm.html Pg_bigm: http://pgbigm.osdn.jp/index_en.html
  23. 23. Pg_bigm performance • Wikipedia title data (2,789,266 records) – https://dumps.wikimedia.org/zhwiki/20160601/ – zhwiki-20160601-pages-articles-multistream-index.txt.bz2 zhwikidb=> select * from zhwiki_index where title like '%香港%'; id1 | id2 | title ----------+-------+---------------------------------------- 5693863 | 2087 | 香港特別行政區基本法第二十三條 11393231 | 4323 | 香港特别行政区 12830042 | 5085 | 香港大学列表 14349335 | 6088 | 香港行政区划 14349335 | 6090 | 香港行政區劃 14349335 | 6091 | 香港十八区 14349335 | 6092 | 香港十八區 16084672 | 7168 | 香港兒童文學作家 18110426 | 8206 | 北區 (香港) 18110426 | 8236 | 東區 (香港) 19537078 | 9528 | 香港專業教育學院 19537078 | 9567 | 香港中文大學
  24. 24. Pg_bigm performance Aggregate (actual time=481.512..481.541 rows=1 loops=1) -> Seq Scan on zhwiki_index (actual time=1.458..478.326 rows=317 loops=1) Filter: (title ~~ '%香港電影%'::text) Rows Removed by Filter: 2788949 Planning time: 0.125 ms Execution time: 481.654 ms (6 rows) select count(*) from zhwiki_index where title like '%香港電影%';
  25. 25. Pg_bigm performance Aggregate (actual time=1.790..1.792 rows=1 loops=1) -> Bitmap Heap Scan on zhwiki_index (actual time=0.299..1.225 rows=317 loops=1) Recheck Cond: (title ~~ '%香港電影%'::text) Rows Removed by Index Recheck: 1 Heap Blocks: exact=191 -> Bitmap Index Scan on zhwiki_index_title_idx (actual time=0.258..0.258 rows=318 loops=1) Index Cond: (title ~~ '%香港電影%'::text) Planning time: 0.103 ms Execution time: 1.833 ms (9 rows) select count(*) from zhwiki_index where title like '%香港電影%'; 481.6ms → 1.8ms. 200x faster than a regular LIKE.
  26. 26. Table Partition • Table Partitioning by Range or List – Called “Constraint Exclusion” • Does not scan unnecessary partitions – Determined by the “constraints”. • Is able to eliminate “full table scan” for large tables entirely. https://www.postgresql.org/docs/9.5/static/ddl-partitioning.html
  27. 27. BRIN Index • Block Range INdex (New in 9.5) – Holds "summary“ data, instead of raw data. – Reduces index size tremendously. – Also reduces creation/maintenance cost. – Needs extra tuple fetch to get the exact record. 0 50,000 100,000 150,000 200,000 250,000 300,000 Btree BRIN Elapsedtime(ms) Index Creation 0 50,000 100,000 150,000 200,000 250,000 300,000 Btree BRIN NumberofBlocks Index Size 0 2 4 6 8 10 12 14 16 18 Btree BRIN Elapsedtime(ms) Select 1 record https://gist.github.com/snaga/82173bd49749ccf0fa6c
  28. 28. BRIN Index • Structure of BRIN Index Table File Block Range 1 (128 Blocks) Block Range 2 Block Range 3 Block Range Min. Value Max. Value 1 1992-01-02 1992-01-28 2 1992-01-27 1992-02-08 3 1992-02-08 1992-02-16 … … … Holds only min/max values for “Block Ranges”, 128 blocks each. (in case a date column)
  29. 29. TABLESAMPLE • Allows to get approximate results for aggregations by sampling. • BERNOULLI – Accurate – Sample by Tuple • SYSTEM – Performance – Sample by Block http://blog.2ndquadrant.com/tablesample-in-postgresql-9-5-2/
  30. 30. TABLESAMPLE • Calculating the average of total price. – The actual value and the approximate ones
  31. 31. TABLESAMPLE Without TABLESAMPLE 1787ms SYSTEM Sampl. 22ms BERNOULLI Sampl. 405ms
  32. 32. Parallel Queries • The leader process cooperates with those worker processes for: – Sequential scan – Joins (Nested Loop & Hash) – Aggregations • Will be shipped with 9.6 – 9.6 is beta2 as of today Leader Worker Worker Client Data Read & Examine Query Result Launch & Gather
  33. 33. Parallel Aggregation Performance & Scalability • count(*) on 30M rows – Shows a good parallel scalability
  34. 34. In-Database Analytics User Defined Functions Apache MADlib
  35. 35. In-Database Analytics • In-Database Analytics? – Performs analytics workload in the database without pulling the data out of the server. • Advantages of In-Database Analytics – No need to move “BigData” between server and client for analytics. – Higher performance hardware resources (CPU, memory, storage) compared to client PCs.
  36. 36. In-Database Analytics • User defined functions – PL/Python, PL/R, PL/v8, ... or C lang. – Allow you to run (almost) any logics within the database. • Apache MADlib – Machine Learning Library for PostgreSQL
  37. 37. UDF by Python CREATE OR REPLACE FUNCTION dumpenv(OUT text, OUT text) RETURNS SETOF record AS $$ import os for e in os.environ: plpy.notice(str(e) + ": " + os.environ[e]) yield(e, os.environ[e]) $$ LANGUAGE plpythonu;
  38. 38. UDF by Python CREATE OR REPLACE FUNCTION dumpenv(OUT text, OUT text) RETURNS SETOF record AS $$ import os for e in os.environ: plpy.notice(str(e) + ": " + os.environ[e]) yield(e, os.environ[e]) $$ LANGUAGE plpythonu; testdb=# select * from dumpenv() order by 1 limit 10; column1 | column2 --------------------+----------------------- G_BROKEN_FILENAMES | 1 HISTCONTROL | ignoredups HISTSIZE | 1000 HOME | /home/snaga HOSTNAME | localhost.localdomain LANG | ja_JP.UTF-8 LC_COLLATE | C LC_CTYPE | C LC_MESSAGES | C LC_MONETARY | C (10 rows)
  39. 39. Apache MADlib • An Open Source Machine Learning Library – Can run in PostgreSQL, Greenplum Database and Apache HAWQ. – Supports many ML algorithms. http://madlib.incubator.apache.org/
  40. 40. Others Strict type checking and constraints. Industry Standard Interface (for BI tools)
  41. 41. Others • Strict type checking and constraints. – Avoid “Garbage in, garbage out.” • Industry Standard Interface (for BI tools) – ODBC, JDBC
  42. 42. Summary • PostgreSQL has already had lots of features that help your analytics project – In terms of productivity and performance. • And more “BigData” features are coming in the future release. – Parallel query must be a big-shot. • Let’s start your analytic project with PostgreSQL and join our community.  – PostgreSQL 9.6 beta2 is available now!
  43. 43. Resources • http://www.postgresql.org • http://wiki.postgresql.org • http://planet.postgresql.org • http://pgcon.org
  44. 44. pgDay Asia 2016 • pgDay Asia 2016 / FOSSASIA 2016 – March 17-19 in Singapore • Speakers: – 19+ speakers from 9 countries • Sessions: – 19 Regular Sessions. – Plus, lightning talks • Attendees: – Around 100 attendees
  45. 45. pgDay Asia 2017 • FOSSASIA 2017 (March, 2017) – Probably, the same format, in the same season, in the same region. • Do not miss the next one! – Will be better and bigger.  • Join us at: – http://pgday.asia – https://www.facebook.com/pgdayasia
  46. 46. Q&A

×