Islamabad PUG - Big Data & PostgreSQL; Using TABLESAMPLE to Analyze Very Large Datasets2. © 2ndQuadrant 2016
Who am I?
Director, Products @ 2ndQuadrant
Got “pushed” into PostgreSQL in
2004, ended up falling in love
with it
Not a hardcore techie, yet
passionate about open source
software
Interested in Big Data, especially the
newer PostgreSQL features
supporting it
2011
2015
3. © 2ndQuadrant 2016
What is Big Data?
Volume
Size: Text files to HD videos
Sources: Spreadsheets to sensors
From lakes to oceans
Velocity
More sources imply more speed
Faster connectivity implies more speed
High-paced world requires faster turnaround
Variety
4. © 2ndQuadrant 2016
What is the problem?
Number of Rows Size on Disk (MB) Time Taken (ms)
1k 0.23 219.706
100k 24 1,302.135
1M 195 7,696.386
5M 951 40,691.603
10M 1,923 60,012.457
100M 19,456 801,493.319
5. © 2ndQuadrant 2016
Why is this significant?
Data mining has typically been a painful process
Major contributor to the pain has been the time it
takes for queries to return
Many false steps before the required data is
identified
Waiting time is wasted time
Sampling, count based or time based, reduces the
wasted time significantly
6. © 2ndQuadrant 2016
What is TABLESAMPLE?
Ability to read a random sample of data
in a table
Defined in SQL:2003 (5th revision of
SQL)
Implemented in PostgreSQL 9.5
11. © 2ndQuadrant 2016
REPEATABLE results
(Reminder: [ REPEATABLE ( seed ) ])
Optional argument
Used if random, yet repeatable results are
required
seed and argument need to be the same to
produce repeatable results
Any changes made to the table will result in a
different data set
12. © 2ndQuadrant 2016
Now it gets interesting …
TABLESAMPLE allows for additional sampling methods
via extensions
tsm_system_time specifies max number of milliseconds
to spend reading a table
Implements the syntax:
SELECT select_expression
FROM table_name
TABLESAMPLE SYSTEM_TIME (argument)
14. © 2ndQuadrant 2016
Enter Orange ...
Funded by AXLE
(http://axleproject.eu)
Same project funded
TABLESAMPLE
Available integrated with
PostgreSQL in 2UDA
(http://2ndquadrant.com/2uda)
Uses TABLESAMPLE to
very quickly create
visualizations for data
Can quickly create
predictive models
16. © 2ndQuadrant 2016
Other Big Data features in PostgreSQL
● HSTORE
XML
JSON & JSONB
BRIN INDEXES
Parallel sequential scan
Parallel aggregates
FDWs
18. © 2ndQuadrant 2016
Moving Forward …
Next meetup: Tentatively August 19, 2016
Please come forward and share your
PostgreSQL stories
Today’s refreshments are sponsored by
2ndQuadrant - THANK YOU!
Need more sponsors
OR
Need to start charging for these sessions
20. © 2ndQuadrant 2016
Umair Shahid
Email: umair.shahid@2ndQuadrant.com
Twitter: @pg_umair
2ndQuadrant is hiring - All geographies!
Thank you for your time!