CrateDB & PostgreSQL
OldSQL to NewSQL
11th July 2017
@claus__m
About
~2yrs at Crate.io
DevRel/Field Engineering/Support/
Integrations/…
Speaking
Conferences, meetups, ...
Working with customers
Consulting, pre- and post-sales
@claus__m
Agenda
Failures
What, how, and when?
PostgreSQL
Concept overview
CrateDB
Concept overview
Discussion
NewSQL or not? Benefits and drawbacks.
Use Cases
Wrap up
@claus__m
Failures
@claus__m
Database Failures
Consequences
Data loss
Lost updates, dirty reads, ...
Service interruptions
Services can’t work without their database
Slow performance
Users may lose interest
Pressure
DBAs in the spotlight
@claus__m
What Makes Databases
Fail?
Overloaded
Insufficient hardware (RAM, CPU, disk),
swapping, inefficient queries
Failure
Hardware may fail on many levels: e.g.
Network, disk, RAM
Platform
Configuration errors, updates, resource
sharing, bugs
People
Malicious intent, sloppiness, ...
@claus__m
Overloaded
Insufficient hardware (RAM, CPU, disk),
swapping, inefficient queries
Failure
Hardware may fail on many levels: e.g.
Network, disk, RAM
Platform
Configuration errors, updates, resource
sharing, bugs
People
Malicious intent, sloppiness, ...
@claus__m
What Makes Databases
Fail?
Overview
Concepts and other things
Index and data
How the database creates indices, stores and
retrieves data
Search and scans
How the data is found
Replication and high availability
Distribution and achieving zero downtime
@claus__m
Assessment
PostgreSQL
@claus__m
Overview
Multi-process System
fork() to clone processes from postmaster to
postgres instances with shared memory
Technology
C/C++ based natively compiled
Optimization
Cost-based optimizer
Transactional
ACID compliant
@claus__m
Index And Data
Tree-based
An in-memory B-Tree, defined in CREATE
TABLE or ALTER TABLE
In Memory & On Disk
8K data pages in shared buffer cache and on
disk
Item Pointers
Only major changes are reflected in the index
(e.g. INSERT/DELETES)
@claus__m
@claus__mhttp://use-the-index-luke.com/sql/anatomy/the-tree
Searches And Scans
Sequential
Go over every block and execute a predicate
Index-based
Find something using an index on that column,
or a full index scan
Bitmap-based
Mark matches in boolean queries for results
@claus__m
Replication And
High Availability
Disk based
By sharing a disk or continuously cloning a disk
Log-shipping
Send the write-ahead-log to the standby server,
which can answer reads
Master/Master
Sends rows to the other master, can answer
reads and writes, locks rows/tables
Client-sharding
Shard the data on a client/proxy and route
accordingly
@claus__m
CrateDB
@claus__m
Overview
Multi-threaded System
Thread-pools to read/write Lucene segments
Technology
Java/JVM based
Optimization
Naive optimization on query levels
Eventually Consistent
Atomic operations per row, optimistic
concurrency only
Distributed By Default
Transparent partitioning and sharding @claus__m
Index And Data
Inverted index
Term dictionary where field values point to
rows (posting list)
Field cache
“Inverted inverted index”, column names point
to the possible values and their rows
On disk, cached in memory
Immutable segments on disk, binary search in
each segment, cached with mmap() into ram
pages
@claus__m
Example Posting List
@claus__m
Index And Data
@claus__m
Shards
Compounds of multiple immutable segments,
merged occasionally
Rows are documents, columns are fields
Vector space model to weight and score
searches (_score field)
Multi-threaded index access
Shards are multiple segments, each is read
with a thread
Replication And
High Availability
Shared nothing architecture
Every node handles every task
Shard-based
Replicas are copies of shards that are
distributed in the cluster evenly
Consistency
Elected leader maintains and distributes a
consistent cluster state
CAP
Tuneable consistency with synchronous inserts
@claus__m
Discussion
@claus__m
PostgreSQL: Strengths
Single-Node-Performance
Predictable and fast
SQL Sophistication
Lots of features, many of them heavily
optimized
Transactions
ACID compliance, concurrency control
@claus__m
PostgreSQL: Weaknesses
Distribution
High availability or working with huge data sets
requires 3rd party software, partitioning
Ingest speed
ACID compliance slows down inserts
Operational Complexity/DevOps Readiness
Highly controllable features make it hard to
manage
Schema Flexibility
Schema evolution management required
@claus__m
CrateDB: Strengths
Distribution
Distributed by nature, with tunable consistency
Ingest speed
Solid insert speeds with bulk inserts
Operational Complexity/DevOps Readiness
High flexibility, containerization, sane defaults
Schema Flexibility
Schema evolution on the fly
Built-in Search
Fulltext capabilities
@claus__m
CrateDB: Weaknesses
Single-Node-Performance
Distribution overhead requires a certain cluster
size to be efficient
SQL Features
Many features are yet missing or hard to do in
a distributed system
Transactions
No ACID compliance, eventual
consistency/optimistic concurrency requires
client-side handling
@claus__m
Use Cases
@claus__m
Use Cases: PostgreSQL
ORMs
Broad integration in various object-relational
mappers in frameworks (hibernate, …)
Transaction-based workloads
Single, high-value transactions
Extensive SQL compliance
Required support for views, stored procedures,
…
Small data sets
Hundreds of MBs to several GB
@claus__m
Use Cases: CrateDB
DevOps
Flexible schemas, ad-hoc queries, easy
maintenance
Analytics, machine learning
Large scale inserts/queries, high concurrency,
SQL
Fulltext search
Built-in tools for text-mining/analysis, built on
the de-facto standard of search
@claus__m
Thanks!
Links
https://github.com/crate
https://crate.io
Follow us on twitter
@crateio @claus__m
Next webinar: Scale your SQL database
with Docker, 27th July
Q & A

OldSQL to NewSQL

  • 1.
    CrateDB & PostgreSQL OldSQLto NewSQL 11th July 2017 @claus__m
  • 2.
    About ~2yrs at Crate.io DevRel/FieldEngineering/Support/ Integrations/… Speaking Conferences, meetups, ... Working with customers Consulting, pre- and post-sales @claus__m
  • 3.
    Agenda Failures What, how, andwhen? PostgreSQL Concept overview CrateDB Concept overview Discussion NewSQL or not? Benefits and drawbacks. Use Cases Wrap up @claus__m
  • 5.
  • 6.
    Database Failures Consequences Data loss Lostupdates, dirty reads, ... Service interruptions Services can’t work without their database Slow performance Users may lose interest Pressure DBAs in the spotlight @claus__m
  • 7.
    What Makes Databases Fail? Overloaded Insufficienthardware (RAM, CPU, disk), swapping, inefficient queries Failure Hardware may fail on many levels: e.g. Network, disk, RAM Platform Configuration errors, updates, resource sharing, bugs People Malicious intent, sloppiness, ... @claus__m
  • 8.
    Overloaded Insufficient hardware (RAM,CPU, disk), swapping, inefficient queries Failure Hardware may fail on many levels: e.g. Network, disk, RAM Platform Configuration errors, updates, resource sharing, bugs People Malicious intent, sloppiness, ... @claus__m What Makes Databases Fail?
  • 9.
    Overview Concepts and otherthings Index and data How the database creates indices, stores and retrieves data Search and scans How the data is found Replication and high availability Distribution and achieving zero downtime @claus__m Assessment
  • 10.
  • 11.
    Overview Multi-process System fork() toclone processes from postmaster to postgres instances with shared memory Technology C/C++ based natively compiled Optimization Cost-based optimizer Transactional ACID compliant @claus__m
  • 12.
    Index And Data Tree-based Anin-memory B-Tree, defined in CREATE TABLE or ALTER TABLE In Memory & On Disk 8K data pages in shared buffer cache and on disk Item Pointers Only major changes are reflected in the index (e.g. INSERT/DELETES) @claus__m
  • 13.
  • 14.
    Searches And Scans Sequential Goover every block and execute a predicate Index-based Find something using an index on that column, or a full index scan Bitmap-based Mark matches in boolean queries for results @claus__m
  • 15.
    Replication And High Availability Diskbased By sharing a disk or continuously cloning a disk Log-shipping Send the write-ahead-log to the standby server, which can answer reads Master/Master Sends rows to the other master, can answer reads and writes, locks rows/tables Client-sharding Shard the data on a client/proxy and route accordingly @claus__m
  • 16.
  • 17.
    Overview Multi-threaded System Thread-pools toread/write Lucene segments Technology Java/JVM based Optimization Naive optimization on query levels Eventually Consistent Atomic operations per row, optimistic concurrency only Distributed By Default Transparent partitioning and sharding @claus__m
  • 18.
    Index And Data Invertedindex Term dictionary where field values point to rows (posting list) Field cache “Inverted inverted index”, column names point to the possible values and their rows On disk, cached in memory Immutable segments on disk, binary search in each segment, cached with mmap() into ram pages @claus__m
  • 19.
  • 20.
    Index And Data @claus__m Shards Compoundsof multiple immutable segments, merged occasionally Rows are documents, columns are fields Vector space model to weight and score searches (_score field) Multi-threaded index access Shards are multiple segments, each is read with a thread
  • 21.
    Replication And High Availability Sharednothing architecture Every node handles every task Shard-based Replicas are copies of shards that are distributed in the cluster evenly Consistency Elected leader maintains and distributes a consistent cluster state CAP Tuneable consistency with synchronous inserts @claus__m
  • 22.
  • 23.
    PostgreSQL: Strengths Single-Node-Performance Predictable andfast SQL Sophistication Lots of features, many of them heavily optimized Transactions ACID compliance, concurrency control @claus__m
  • 24.
    PostgreSQL: Weaknesses Distribution High availabilityor working with huge data sets requires 3rd party software, partitioning Ingest speed ACID compliance slows down inserts Operational Complexity/DevOps Readiness Highly controllable features make it hard to manage Schema Flexibility Schema evolution management required @claus__m
  • 25.
    CrateDB: Strengths Distribution Distributed bynature, with tunable consistency Ingest speed Solid insert speeds with bulk inserts Operational Complexity/DevOps Readiness High flexibility, containerization, sane defaults Schema Flexibility Schema evolution on the fly Built-in Search Fulltext capabilities @claus__m
  • 26.
    CrateDB: Weaknesses Single-Node-Performance Distribution overheadrequires a certain cluster size to be efficient SQL Features Many features are yet missing or hard to do in a distributed system Transactions No ACID compliance, eventual consistency/optimistic concurrency requires client-side handling @claus__m
  • 27.
  • 28.
    Use Cases: PostgreSQL ORMs Broadintegration in various object-relational mappers in frameworks (hibernate, …) Transaction-based workloads Single, high-value transactions Extensive SQL compliance Required support for views, stored procedures, … Small data sets Hundreds of MBs to several GB @claus__m
  • 29.
    Use Cases: CrateDB DevOps Flexibleschemas, ad-hoc queries, easy maintenance Analytics, machine learning Large scale inserts/queries, high concurrency, SQL Fulltext search Built-in tools for text-mining/analysis, built on the de-facto standard of search @claus__m
  • 30.
    Thanks! Links https://github.com/crate https://crate.io Follow us ontwitter @crateio @claus__m Next webinar: Scale your SQL database with Docker, 27th July
  • 31.