Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
sales@chartio.com
(855) 232-0320
sales@chartio.com
(855) 232-0320
How to get the best of both worlds with PostgreSQL
Relat...
sales@chartio.com
(855) 232-0320
The Prevailing View - Logical
Dimension Relational Non-Relational
Schema objects ● Struct...
sales@chartio.com
(855) 232-0320
The Prevailing View - Physical
Dimension Relational Non-Relational
Parallel query
process...
sales@chartio.com
(855) 232-0320
The Prevailing View - Summary
- RDBMS have nice properties for producing rich data
- Easi...
sales@chartio.com
(855) 232-0320
The Practical Reality
- The product offerings are starting to overlap
- As we’ll see, old...
sales@chartio.com
(855) 232-0320
SQL 2011: Not Your Parents’ SQL
- Many people still think of SQL in terms of SQL-92
- Sin...
sales@chartio.com
(855) 232-0320
The PostgreSQL Extension System
- Various extension points
- Procedural languages
- Forei...
sales@chartio.com
(855) 232-0320
Query Languages
- Not only SQL
- Native
pgSQL Tcl Perl Python
- Community
Java PHP R Java...
sales@chartio.com
(855) 232-0320
Data Types & Schema Objects
- JSONB
- https://wiki.postgresql.org/images/b/b4/Pg-as-nosql...
sales@chartio.com
(855) 232-0320
Data Science and Machine Learning
http://madlib.net/
sales@chartio.com
(855) 232-0320
Sharding
- https://github.com/citusdata/pg_shard
sales@chartio.com
(855) 232-0320
Parallel Processing
- MPP
- Proprietary
- Open source
- Columnar FDW:
- https://github.co...
sales@chartio.com
(855) 232-0320
How Far Can We Go?
- Web app framework
- http://blog.aquameta.com/
- REST API
- https://g...
sales@chartio.com
(855) 232-0320
Conclusion
- With today’s RDBMS you get
- more than rows and columns
- more than SELECT, ...
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
10 Reasons to Start Your Analytics Project with PostgreSQL
Next
Download to read offline and view in fullscreen.

0

Share

Download to read offline

Producing and Analyzing Rich Data with PostgreSQL

Download to read offline

As a data engineer at Chartio, a large part of my work has involved helping data teams get the most out of their data pipelines and warehouses so the topic of data cleansing and processing is something near and dear to me. Over the past five years or so, I’ve noticed the perception that relational databases are only good at descriptive statistics (count, sum, avg, etc.) on medium sized structured data sets. In other words, SQL just doesn’t work for inferential, predictive or causal analysis on larger or unstructured data sets. Although this may have been true five years ago, it’s a lot less true today.

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Producing and Analyzing Rich Data with PostgreSQL

  1. 1. sales@chartio.com (855) 232-0320 sales@chartio.com (855) 232-0320 How to get the best of both worlds with PostgreSQL Relational and Non-Relational Databases
  2. 2. sales@chartio.com (855) 232-0320 The Prevailing View - Logical Dimension Relational Non-Relational Schema objects ● Structured rows and columns ● Schema on write ● Referential integrity ● Painful migrations ● Unstructured files, docs, etc ● Schema on read ● No referential integrity ● No migrations Query languages ● SQL ● Declarative ● Easy enough for non-tech users ● Various ● Procedural ● Requires some programming skills Exploratory analysis ● Native support for joins ● Interactive/low execution overhead ● No native support for joins ● OLAP - Batch processing Data science and ML ● Only descriptive statistics ● Requires exporting dumps/samples ● Robust ecosystem ● Does not require exports
  3. 3. sales@chartio.com (855) 232-0320 The Prevailing View - Physical Dimension Relational Non-Relational Parallel query processing ● Single node system ● Single process per query ● Multiple node system ● Multiple processes per query Concurrency ● High concurrency ● Single process per connection ● OLAP - low concurrency/high scheduling overhead High Availability & Replication ● Async and sync replication ● HA may not be native ● Async and sync replication ● HA likely to be native Sharding ● Sharding may not be native ● Difficult to manage ● Sharding likely to be native ● Easy to manage
  4. 4. sales@chartio.com (855) 232-0320 The Prevailing View - Summary - RDBMS have nice properties for producing rich data - Easier for non-tech users and exploratory analysis - Probably don’t meet the needs of today’s analysts - Data science & Machine Learning - Parallel processing - Definitely don’t meet the needs of today’s apps - Schema migrations - Replication and sharding
  5. 5. sales@chartio.com (855) 232-0320 The Practical Reality - The product offerings are starting to overlap - As we’ll see, old dogs are learning new tricks and vice versa - However, the market sizes and talent pools still look very different
  6. 6. sales@chartio.com (855) 232-0320 SQL 2011: Not Your Parents’ SQL - Many people still think of SQL in terms of SQL-92 - Since then we’ve had: SQL:1999, SQL:2003, SQL:2006, SQL:2008, SQL:2011 - http://use-the-index-luke.com/blog/2015-02/modern-sql - Common Table Expressions (CTEs) / Recursive CTEs - Window Functions - Ordered-set Aggregates - Lateral joins - Temporal support - The list goes on...
  7. 7. sales@chartio.com (855) 232-0320 The PostgreSQL Extension System - Various extension points - Procedural languages - Foreign data wrappers - Data types/operators - UDFs/UDAs (SQL, C/C++) - Indexes (GiST, GIN) - Custom Background Workers - PostgreSQL Extension Network - http://pgxn.org/ - PyPI, RubyGems, CPAN, CRAN
  8. 8. sales@chartio.com (855) 232-0320 Query Languages - Not only SQL - Native pgSQL Tcl Perl Python - Community Java PHP R Javascript Ruby Scheme sh
  9. 9. sales@chartio.com (855) 232-0320 Data Types & Schema Objects - JSONB - https://wiki.postgresql.org/images/b/b4/Pg-as-nosql-pgday- fosdem-2013.pdf - postgresql-hll - PostGIS - Foreign data wrappers - Oracle, SQL Server, MySQL, JDBC, MongoDB, CouchDB, Redis, S3, twitter, OpenStreetMap, LDAP, RSS, more… - Multicorn - https://github.com/shish/pgosquery
  10. 10. sales@chartio.com (855) 232-0320 Data Science and Machine Learning http://madlib.net/
  11. 11. sales@chartio.com (855) 232-0320 Sharding - https://github.com/citusdata/pg_shard
  12. 12. sales@chartio.com (855) 232-0320 Parallel Processing - MPP - Proprietary - Open source - Columnar FDW: - https://github.com/citusdata/cstore_fdw - Parallel query - http://rhaas.blogspot.com/2015/03/parallel-sequential-scan-for-postgresql.html
  13. 13. sales@chartio.com (855) 232-0320 How Far Can We Go? - Web app framework - http://blog.aquameta.com/ - REST API - https://github.com/begriffs/postgrest - Unit testing framework - http://pgtap.org/ - Firewall - https://github.com/uptimejp/sql_firewall - Full-text search - https://github.com/zombodb/zombodb
  14. 14. sales@chartio.com (855) 232-0320 Conclusion - With today’s RDBMS you get - more than rows and columns - more than SELECT, FROM, WHERE, GROUP BY, ORDER BY - more than a single machine - Make sure you get the full return on your investment! Get your Chartio free trial! sales@chartio.com (855) 232-0320

As a data engineer at Chartio, a large part of my work has involved helping data teams get the most out of their data pipelines and warehouses so the topic of data cleansing and processing is something near and dear to me. Over the past five years or so, I’ve noticed the perception that relational databases are only good at descriptive statistics (count, sum, avg, etc.) on medium sized structured data sets. In other words, SQL just doesn’t work for inferential, predictive or causal analysis on larger or unstructured data sets. Although this may have been true five years ago, it’s a lot less true today.

Views

Total views

2,888

On Slideshare

0

From embeds

0

Number of embeds

1,718

Actions

Downloads

8

Shares

0

Comments

0

Likes

0

×