HTAP Queries & Data Fabrics
Atif Rahman
@mantaq10
7th December, 2018
PostgreSQL Down Under
Melbourne, Australia
The agenda
OLTP vs OLAP vs HTAP
The Problem Statement
Data Fabrics
Key PostgreSQL Features
Foreign Data Wrappers
Distributed Cache
Background
Patterns
Components
OLTP vs OLAP vs HTAP
OLTP
OLAP
HTAP
Smaller
transactions
Lots of
them
Lots of
Updates
ACID (Transactional)
CRUD (Commands)
complex
queries
Large
working sets
Bulk loads &
offloads
INDEXING
AGGREGATIONS
Mixed
workloads
on the same
system
Analytics on
‘inflight’
transactional
data
Addresses
resource
contention
FEDERATION
SYNCHRONISATION
Hybrid Transactional / Analytical Processing
Data Organisation
Row wise Writes
OLTP
Column Wise Reads
OLAP
OLTP OLAP
Analytics
Latency
Data
Redundancy
Single platform for
multimodel HTAP
HTAP
Data & Application Integration
Data Integration
(ETL/ELT)
Application Integration
(ESB / API)
Data Virtualisation
(DV)
Low Fidelity View
Integration Type Physical Movement and
Consolidation
Synchronization and
Propagation
Abstraction, Virtual
Consolidation, Federation
Purpose Database to Database Application to Application Database to Application
Agility* Weeks, Months Minutes, Hours Hours, Days
Repository Warehouse / Lake Transactional System Semantic Layer
Run Time* Typically Scheduled Event Driven Typically OnDemand
Data Warehouse vs Data Lakes
Data Warehouse Data Lake
• Schema on write
• (early binding)
• BI and analysts
• Arguably better governed.
• MPP / SMP databases
• Schema on read
• (Late binding)
• Data scientists
• Arguably more flexible
• MapReduce et al.
Data Warehouse vs Data Lakes
Data Warehouse Data Lake
Metro Uber
The Problems
• ACID
• Atomicity
• Concurrency
• Isolation
• Durability
• BASE
• Basic Availability
• Soft States
• Eventual Consistency
• CAP Theorem (Consistency vs Availability vs Partitions)
• CQRS (Command Query Response Segregation)
[Distributed Cache]
Metrics &
Analytics
Complex
Alerts
Scheduled
Feeds
Adhoc
Queries
Detail
Data
Logical Unified
Data Model
Subject Area
Logical Data
Models
Enterprise Data Fabric Cluster
Unified queries
Business &
Exceptions Rules
Normalisation and
Standardisation Rules
Top Down Data
Quality (and profiling)
Connectors (JDBC,
APIs, ODBC, etc)
Schedulers and
Refresh Jobs Rules
Query Planners &
Optimiser(s)
Data Fabric Architecture
• Schema Store
• Distributed Cache
• Query Federation
• Query Optimisation
• Semantic Normalisation
Foreign Data Wrappers
One database to rule them all,
One database to find them,
One database to bring them all,
And in a wrapper bind them.
Foreign Data Wrappers
• Uses the standard compliant SQL/MED
• *Data Type Translations (SQL, NoSQL etc)
• *Push Down Predicates
• WHERE and ORDER BY are propagated
• Required COLUMNS
• *Supports CRUD
• *Two Way Joins
• Import Foreign Schemas
*May vary based on specific wrapper
(Reflections) the PostgreSQL way
TABLE_ALPHA
A
Database
Cluster
Shared
Buffer
Pool TABLE_ALPHA
A
BEGIN;
INSERT INTO
TABLE_ALPHA
VALUES (‘B’);
COMMIT;
TABLE_ALPHA
A B
Dirty
Page(s)
Without WAL Logs
OS Cache
Use WAL Logs for Cache Synchronisation
TABLE_ALPHA
A
TABLE_ALPHA
A
TABLE_ALPHA
A B
Dirty
Page(s)
BEGIN;
INSERT INTO
TABLE_ALPHA
VALUES (‘B’);
COMMIT;
WAL
Checkpoints
write back to
disk
pg_prewarm()
Loads OS and
buffer cache(s)
Distributed Cache with PostgreSQL
Disk
Cache
Disk
Cache
FDW
Cache
FDW
Cache
Distributed PG Cluster with
Federated queries loaded
directed to Cache
BI APP DS MSG
Disk
Distributed Cache
Disk FDW FDW
Distributed PG Cluster
integrated With Distributed
Cache Technologies
BI APP DS MSG
JDBC, pgmemcache
Apache Ignite
Dremio
Terracotta
memcached
Some Technologies
Key PostgreSQL Features for Fabrics
Schema Store
Distributed Cache
Query Federation
Query Optimisation
Semantic Normalisation
Dirty Pages
pg_prewarm()
pgmemcache
Persistence WAL
Database
CTEs,
Least Cost Routing
ORM + FRM
Foreign Data Wrappers
Key Takeaways
• Cloud migrations
• Jurisdictions (privacy and availability zones)
• Computing at the edge of the network.
• Data Virtualisation
• Rights to be Forgotten (GDPR)
• Query Lineage & Audit across ALL data
• Areas for future development.
HTAP OLAP OLTP

HTAP Queries

  • 1.
    HTAP Queries &Data Fabrics Atif Rahman @mantaq10 7th December, 2018 PostgreSQL Down Under Melbourne, Australia
  • 2.
    The agenda OLTP vsOLAP vs HTAP The Problem Statement Data Fabrics Key PostgreSQL Features Foreign Data Wrappers Distributed Cache Background Patterns Components
  • 3.
    OLTP vs OLAPvs HTAP OLTP OLAP HTAP Smaller transactions Lots of them Lots of Updates ACID (Transactional) CRUD (Commands) complex queries Large working sets Bulk loads & offloads INDEXING AGGREGATIONS Mixed workloads on the same system Analytics on ‘inflight’ transactional data Addresses resource contention FEDERATION SYNCHRONISATION
  • 4.
    Hybrid Transactional /Analytical Processing Data Organisation Row wise Writes OLTP Column Wise Reads OLAP OLTP OLAP Analytics Latency Data Redundancy Single platform for multimodel HTAP HTAP
  • 6.
    Data & ApplicationIntegration Data Integration (ETL/ELT) Application Integration (ESB / API) Data Virtualisation (DV) Low Fidelity View Integration Type Physical Movement and Consolidation Synchronization and Propagation Abstraction, Virtual Consolidation, Federation Purpose Database to Database Application to Application Database to Application Agility* Weeks, Months Minutes, Hours Hours, Days Repository Warehouse / Lake Transactional System Semantic Layer Run Time* Typically Scheduled Event Driven Typically OnDemand
  • 7.
    Data Warehouse vsData Lakes Data Warehouse Data Lake • Schema on write • (early binding) • BI and analysts • Arguably better governed. • MPP / SMP databases • Schema on read • (Late binding) • Data scientists • Arguably more flexible • MapReduce et al.
  • 9.
    Data Warehouse vsData Lakes Data Warehouse Data Lake Metro Uber
  • 12.
    The Problems • ACID •Atomicity • Concurrency • Isolation • Durability • BASE • Basic Availability • Soft States • Eventual Consistency • CAP Theorem (Consistency vs Availability vs Partitions) • CQRS (Command Query Response Segregation)
  • 13.
    [Distributed Cache] Metrics & Analytics Complex Alerts Scheduled Feeds Adhoc Queries Detail Data LogicalUnified Data Model Subject Area Logical Data Models Enterprise Data Fabric Cluster Unified queries Business & Exceptions Rules Normalisation and Standardisation Rules Top Down Data Quality (and profiling) Connectors (JDBC, APIs, ODBC, etc) Schedulers and Refresh Jobs Rules Query Planners & Optimiser(s) Data Fabric Architecture • Schema Store • Distributed Cache • Query Federation • Query Optimisation • Semantic Normalisation
  • 14.
    Foreign Data Wrappers Onedatabase to rule them all, One database to find them, One database to bring them all, And in a wrapper bind them.
  • 15.
    Foreign Data Wrappers •Uses the standard compliant SQL/MED • *Data Type Translations (SQL, NoSQL etc) • *Push Down Predicates • WHERE and ORDER BY are propagated • Required COLUMNS • *Supports CRUD • *Two Way Joins • Import Foreign Schemas *May vary based on specific wrapper
  • 16.
    (Reflections) the PostgreSQLway TABLE_ALPHA A Database Cluster Shared Buffer Pool TABLE_ALPHA A BEGIN; INSERT INTO TABLE_ALPHA VALUES (‘B’); COMMIT; TABLE_ALPHA A B Dirty Page(s) Without WAL Logs OS Cache Use WAL Logs for Cache Synchronisation TABLE_ALPHA A TABLE_ALPHA A TABLE_ALPHA A B Dirty Page(s) BEGIN; INSERT INTO TABLE_ALPHA VALUES (‘B’); COMMIT; WAL Checkpoints write back to disk pg_prewarm() Loads OS and buffer cache(s)
  • 17.
    Distributed Cache withPostgreSQL Disk Cache Disk Cache FDW Cache FDW Cache Distributed PG Cluster with Federated queries loaded directed to Cache BI APP DS MSG Disk Distributed Cache Disk FDW FDW Distributed PG Cluster integrated With Distributed Cache Technologies BI APP DS MSG JDBC, pgmemcache Apache Ignite Dremio Terracotta memcached Some Technologies
  • 18.
    Key PostgreSQL Featuresfor Fabrics Schema Store Distributed Cache Query Federation Query Optimisation Semantic Normalisation Dirty Pages pg_prewarm() pgmemcache Persistence WAL Database CTEs, Least Cost Routing ORM + FRM Foreign Data Wrappers
  • 19.
    Key Takeaways • Cloudmigrations • Jurisdictions (privacy and availability zones) • Computing at the edge of the network. • Data Virtualisation • Rights to be Forgotten (GDPR) • Query Lineage & Audit across ALL data • Areas for future development.
  • 20.

Editor's Notes

  • #3 Anyone heard of HTAP before? Big Data
  • #4 They freak out when you try to run batch queries. Finicky about offload windows They Typically have huge backlogs Emerging pattern. Atomic All operations in a transaction succeed or every operation is rolled back. Consistent On the completion of a transaction, the database is structurally sound. Isolated Transactions do not contend with one another. Contentious access to data is moderated by the database so that transactions appear to run sequentially. Durable The results of applying a transaction are permanent, even in the presence of failures. asic Availability The database appears to work most of the time. Soft-state Stores don’t have to be write-consistent, nor do different replicas have to be mutually consistent all the time. Eventual consistency Stores exhibit consistency at some later point (e.g., lazily at read time).
  • #5 NOT A NEW CONCEPT
  • #6 The scary santa claus!
  • #8 We need both but we didn’t realize that until recently. Late Binding
  • #9 It started as an epic battle! Society was divided…
  • #10 The metro and the uber We need both but we didn’t realize that until recently. Late Binding
  • #11  Issues in CAP theorem BASE is not ACID CQRS is well suited to edge cases, not many instances where it worked for the norms. Other issues in distributed systems remain or have become more specialised problems Two approaches to overcome operational copmexlity: Multimodel databases or tightly-integrated middleware over multiple single-model data stores Caveat: In order for a custom data model to support concurrent updates, the database must be able to synchronize updates across multiple keys. ACID transactions, if they are sufficiently performant, allow such synchronization.[12] JSON documents, graphs, and relational tables can all be implemented in a manner that inherits the horizontal scalability and fault-tolerance of the underlying data store.
  • #13 Caveat: In order for a custom data model to support concurrent updates, the database must be able to synchronize updates across multiple keys. ACID transactions, if they are sufficiently performant, allow such synchronization.[12] JSON documents, graphs, and relational tables can all be implemented in a manner that inherits the horizontal scalability and fault-tolerance of the underlying data store.
  • #15 OLAP systems are great are low volume transactions but typically complex and long running queries with larger working sets Differentiated between adhoc queries vs operational queries. Massively parallel processing engines came out Spins off like Vertica, Greenplum, Netezza and Teradata all were based on PostgreSQL at some point. Architectural drivers Reliability Short queries but very frequent Transaction Support is critical (ACID) Security Typically lots of ad-hoc writes Not great for complex and bulk reads Resource contention Indexes and other data structures. Data is typically in 3NF kind of models Although with PostgreSQL, we know this is not true. JSON, etc.
  • #16 OLAP systems are great are low volume transactions but typically complex and long running queries with larger working sets Differentiated between adhoc queries vs operational queries. Massively parallel processing engines came out Spins off like Vertica, Greenplum, Netezza and Teradata all were based on PostgreSQL at some point. Architectural drivers Reliability Short queries but very frequent Transaction Support is critical (ACID) Security Typically lots of ad-hoc writes Not great for complex and bulk reads Resource contention Indexes and other data structures. Data is typically in 3NF kind of models Although with PostgreSQL, we know this is not true. JSON, etc.
  • #17 OLAP systems are great are low volume transactions but typically complex and long running queries with larger working sets Differentiated between adhoc queries vs operational queries. Massively parallel processing engines came out Spins off like Vertica, Greenplum, Netezza and Teradata all were based on PostgreSQL at some point. Architectural drivers Reliability Short queries but very frequent Transaction Support is critical (ACID) Security Typically lots of ad-hoc writes Not great for complex and bulk reads Resource contention Indexes and other data structures. Data is typically in 3NF kind of models Although with PostgreSQL, we know this is not true. JSON, etc.
  • #20 OLAP systems are great are low volume transactions but typically complex and long running queries with larger working sets Differentiated between adhoc queries vs operational queries. Massively parallel processing engines came out Spins off like Vertica, Greenplum, Netezza and Teradata all were based on PostgreSQL at some point. Architectural drivers Reliability Short queries but very frequent Transaction Support is critical (ACID) Security Typically lots of ad-hoc writes Not great for complex and bulk reads Resource contention Indexes and other data structures. Data is typically in 3NF kind of models Although with PostgreSQL, we know this is not true. JSON, etc.