© 2ndQuadrant 2016
Big Data & PostgreSQL
Using TABLESAMPLE to Analyze
Very Large Datasets
By Umair Shahid
© 2ndQuadrant 2016
Who am I?
● Got “pushed” into PostgreSQL in 2004, ended
up falling in love with it
● Not a hardcore techie, yet passionate about
open source software
● Heading the productization efforts at
2ndQuadrant
● Interested in Big Data, specifically the newer
PostgreSQL features supporting it
© 2ndQuadrant 2016
What is the problem?
Number of Rows Size on Disk (MB) Time Taken (ms)
1k 0.23 219.706
100k 24 1,302.135
1M 195 7,696.386
5M 951 40,691.603
10M 1,923 60,012.457
100M 19,456 801,493.319
© 2ndQuadrant 2016
Why is this significant?
● Data mining has typically been a painful process
● Major contributor to the pain has been the time it
takes for queries to return
● Many false steps before the required data is
identified
● Waiting time is wasted time
● Sampling, count based or time based, reduces
the wasted time significantly
© 2ndQuadrant 2016
What is TABLESAMPLE?
● Ability to read a random sample of
data in a table
● Defined in SQL:2003 (5th revision of
SQL)
● Implemented in PostgreSQL 9.5
© 2ndQuadrant 2016
Syntax
SELECT select_expression
FROM table_name
TABLESAMPLE sampling_method ( argument [, ...] )
[ REPEATABLE ( seed ) ]
...
© 2ndQuadrant 2016
sampling_method
● argument is percentage of rows
● SYSTEM
○ Block level sampling
○ Very fast
○ Non-independent rows
● BERNOULLI
○ Row level sampling
○ Slower than SYSTEM
○ Independent rows (uniformly random)
© 2ndQuadrant 2016
© 2ndQuadrant 2016
Demo sampling methods
© 2ndQuadrant 2016
REPEATABLE results
● (Reminder: [ REPEATABLE ( seed ) ])
● Optional argument
● Used if random, yet repeatable results are
required
● seed and argument need to be the same to
produce repeatable results
● Any changes made to the table will result in a
different data set
© 2ndQuadrant 2016
Now it gets interesting …
● TABLESAMPLE allows for additional sampling methods
via extensions
● tsm_system_time specifies max number of
milliseconds to spend reading a table
● Implements the syntax:
SELECT select_expression
FROM table_name
TABLESAMPLE SYSTEM_TIME (argument)
© 2ndQuadrant 2016
Demo tsm_system_time
© 2ndQuadrant 2016
Enter Orange ...
● Funded by AXLE (http:
//axleproject.eu)
● Same project funded
TABLESAMPLE
● Available integrated
with PostgreSQL in
2UDA (http:
//2ndquadrant.
com/2uda)
● Uses TABLESAMPLE
to very quickly create
visualizations for data
● Can quickly create
predictive models
© 2ndQuadrant 2016
Demo Orange
You can find a very helpful tutorial at
http://2ndquadrant.com/2uda
© 2ndQuadrant 2016
Other Big Data features in PostgreSQL
● JSON & JSONB
● HSTORE
● XML
● Scale-out by partitioning
○ Check out Postgres-XL (http://www.
postgres-xl.org/)
● etc ...
© 2ndQuadrant 2016
Umair Shahid
Email: umair.shahid@2ndQuadrant.com
Twitter: @pg_umair
2ndQuadrant is hiring - All geographies!
Thank you for your time!

Big Data and PostgreSQL

  • 1.
    © 2ndQuadrant 2016 BigData & PostgreSQL Using TABLESAMPLE to Analyze Very Large Datasets By Umair Shahid
  • 2.
    © 2ndQuadrant 2016 Whoam I? ● Got “pushed” into PostgreSQL in 2004, ended up falling in love with it ● Not a hardcore techie, yet passionate about open source software ● Heading the productization efforts at 2ndQuadrant ● Interested in Big Data, specifically the newer PostgreSQL features supporting it
  • 3.
    © 2ndQuadrant 2016 Whatis the problem? Number of Rows Size on Disk (MB) Time Taken (ms) 1k 0.23 219.706 100k 24 1,302.135 1M 195 7,696.386 5M 951 40,691.603 10M 1,923 60,012.457 100M 19,456 801,493.319
  • 4.
    © 2ndQuadrant 2016 Whyis this significant? ● Data mining has typically been a painful process ● Major contributor to the pain has been the time it takes for queries to return ● Many false steps before the required data is identified ● Waiting time is wasted time ● Sampling, count based or time based, reduces the wasted time significantly
  • 5.
    © 2ndQuadrant 2016 Whatis TABLESAMPLE? ● Ability to read a random sample of data in a table ● Defined in SQL:2003 (5th revision of SQL) ● Implemented in PostgreSQL 9.5
  • 6.
    © 2ndQuadrant 2016 Syntax SELECTselect_expression FROM table_name TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ] ...
  • 7.
    © 2ndQuadrant 2016 sampling_method ●argument is percentage of rows ● SYSTEM ○ Block level sampling ○ Very fast ○ Non-independent rows ● BERNOULLI ○ Row level sampling ○ Slower than SYSTEM ○ Independent rows (uniformly random)
  • 8.
  • 9.
    © 2ndQuadrant 2016 Demosampling methods
  • 10.
    © 2ndQuadrant 2016 REPEATABLEresults ● (Reminder: [ REPEATABLE ( seed ) ]) ● Optional argument ● Used if random, yet repeatable results are required ● seed and argument need to be the same to produce repeatable results ● Any changes made to the table will result in a different data set
  • 11.
    © 2ndQuadrant 2016 Nowit gets interesting … ● TABLESAMPLE allows for additional sampling methods via extensions ● tsm_system_time specifies max number of milliseconds to spend reading a table ● Implements the syntax: SELECT select_expression FROM table_name TABLESAMPLE SYSTEM_TIME (argument)
  • 12.
    © 2ndQuadrant 2016 Demotsm_system_time
  • 13.
    © 2ndQuadrant 2016 EnterOrange ... ● Funded by AXLE (http: //axleproject.eu) ● Same project funded TABLESAMPLE ● Available integrated with PostgreSQL in 2UDA (http: //2ndquadrant. com/2uda) ● Uses TABLESAMPLE to very quickly create visualizations for data ● Can quickly create predictive models
  • 14.
    © 2ndQuadrant 2016 DemoOrange You can find a very helpful tutorial at http://2ndquadrant.com/2uda
  • 15.
    © 2ndQuadrant 2016 OtherBig Data features in PostgreSQL ● JSON & JSONB ● HSTORE ● XML ● Scale-out by partitioning ○ Check out Postgres-XL (http://www. postgres-xl.org/) ● etc ...
  • 16.
    © 2ndQuadrant 2016 UmairShahid Email: umair.shahid@2ndQuadrant.com Twitter: @pg_umair 2ndQuadrant is hiring - All geographies! Thank you for your time!