What is the state of the art of high performance, distributed databases as we head into 2022, and which options are best suited for your own development projects?
The data-intensive applications leading this next tech cycle are typically powered by multiple types of databases and data stores — each satisfying specific needs and often interacting with a broader data ecosystem. Even the very notion of “a database” is evolving as new hardware architectures and methodologies allow for ever-greater capabilities and expectations for horizontal and vertical scalability, performance, and reliability.
In this webinar, ScyllaDB Director of Technology Advocacy Peter Corless will survey the current landscape of distributed database systems and highlight new directions in the industry.
This talk will cover different database and database-adjacent technologies as well as describe their appropriate use cases, patterns and antipatterns with a focus on:
- Distributed SQL, NewSQL and NoSQL
- In-memory datastores and caches
- Streaming technologies with persistent data storage
2. Peter Corless
+ Listen to & share user stories
+ Write blogs & case studies
+ Play (and design) strategy &
roleplaying games
Director of Technical Advocacy
ScyllaDB
3. 3
Distributed Database Landscape 2021
SQL
+ Distributed SQL
+ NewSQL
NoSQL
+ Key-value
+ Document
+ Wide-column
+ Graph
Multi-model
+ SQL + NoSQL
+ Multiple NoSQL
Production Environments
+ On-premises
+ Co-location
+ Public cloud
+ Private cloud
+ Hybrid cloud
+ Multicloud
+ Edge
+ IoT / Embedded
Business / Use Models
+ Open Source License
+ Enterprise License
+ OEM License
+ Service Agreements
Use Cases
+ OLTP
+ OLAP
+ HTAP
+ Time Series
4. 4
This Next Tech Cycle
The wave of innovation we’re currently riding.
7. 7
+ Compute
+ From >100 cores to >1,000 cores per server
+ From multicore CPUs → full System on a Chip (SoC) designs (CPU, GPU, Cache, Memory)
+ Memory
+ Terabyte-scale RAM per server
+ DDR5 — 4600 MHz in 2020, 8000 MHz by 2024
+ DDR6 — 9600 MHz by 2025
+ Persistent memory — memory mode
+ Storage
+ Petabyte-scale storage per server
+ NVMe 2.0 [2021] — separation of base and transport
+ Persistent memory — app direct (storage) mode
Hardware Still Vertically Scaling
8. 8
+ Agile [c. 2000]
+ CI/CD = CI [1991] + CD [2009]
+ DevOps [2009]
+ Chaos Monkey [2011]
+ Kubernetes [2014]
+ GitOps [2017]
+ DevSecOps [2018]
Methodologies Still Evolving
How It Started
How It’s Going
How It Evolved
10. 10
+ <1 terabyte
+ 1 to 50 terabytes
+ 50-100 terabytes
+ >100 terabytes
How much data do you have under management in your own
transactional database systems?
Poll Question
12. 12
DB-Engines.com
+ 381 databases
+ Some are distributed databases
+ Others are not distributed databases
+ Some are SQL
+ Some are NoSQL
+ Some support both SQL + NoSQL
+ Some support multiple NoSQL types
+ Some are… not easily classifiable
+ A huge industry with some well-known
names
+ But popularity (by itself) ≠
fitness for use for your use case
13. 13
Top 100 Databases
+ Narrowing field helps scope analysis
+ Still results in wide variety of databases
+ Many SQL
+ Many NoSQL
+ ScyllaDB is in the Top 100!
14. 14
Top 100 Databases
(and Database-like systems)
on DB-Engines.com
[as of November 2021]
+ 49 SQL
+ 32 NoSQL
+ 5 Both SQL + NoSQL
+ 5 Search Engines
+ 6 Time Series
+ 3 Others
Top 100 Databases
17. 17
+ Clustering & Distribution Strategies
+ Local clustering — multiple nodes in the same datacenter share updates
+ Cross-cluster updates — multiple clusters can share data between them
+ Multi-datacenter clustering — geographically, even globally disbursed. but same logical cluster
+ Node Roles, High Availability & Failover Strategies
+ Primary-replica (Active-passive; writes to primary only; read-only replicas; “hot standby” modes)
+ Peer-to-peer, leaderless (Active-Active, multi primaries; can write to any replica; no SPOF)
+ Load balancing (client side or service in front of database)
+ Data Replication & Sharding Strategies
+ Replication Factors & Consistency Levels
+ Horizontal Scalability: Manual vs. Auto-sharding
+ Topology Awareness: Rack-awareness, Datacenter-awareness
What do you mean by a “Distributed Database?”
18. 18
The Short List: Systems of Interest
SQL + NewSQL NoSQL
PostgreSQL MongoDB
CockroachDB Redis
ScyllaDB
19. 19
PostgreSQL — distributed SQL
+ Clustering & Distribution Strategies
+ Local clustering — multiple nodes in the same datacenter share updates
+ Cross-cluster updates — multiple clusters can share data between them
+ Multi-datacenter clustering — geographically, even globally disbursed. but same logical cluster
+ Node Roles, High Availability & Failover Strategies
+ Primary-replica (Active-passive; writes to primary only; read-only replicas; “hot standby” modes)
+ Peer-to-peer, leaderless (Active-Active, multi primaries; can write to any replica; no SPOF)
+ Load balancing (client side or service in front of database)
+ Data Replication & Sharding Strategies
+ Replication Factors & Consistency Levels
+ Horizontal Scalability: Manual Sharding vs. Auto-sharding
+ Topology Awareness: Rack-awareness, Datacenter-awareness
Part of base offering
Can be added, but not part of base
20. 20
CockroachDB — NewSQL
+ Clustering & Distribution Strategies
+ Local clustering — multiple nodes in the same datacenter share updates
+ Cross-cluster updates — multiple clusters can share data between them
+ Multi-datacenter clustering — geographically, even globally disbursed. but same logical cluster
+ Node Roles, High Availability & Failover Strategies
+ Primary-replica (Active-passive; writes to primary only; read-only replicas; “hot standby” modes)
+ Peer-to-peer, leaderless (Active-Active, multi primaries; can write to any replica; no SPOF)
+ Load balancing (client side or service in front of database)
+ Data Replication & Sharding Strategies
+ Replication Factors & Consistency Levels
+ Horizontal Scalability: Manual vs. Auto-sharding
+ Topology Awareness: Rack-awareness*, Datacenter-awareness
* Can be manually configured using localities
Part of base offering
Can be added, but not part of base
21. 21
+ Clustering & Distribution Strategies
+ Local clustering — multiple nodes in the same datacenter share updates
+ Cross-cluster updates — multiple clusters can share data between them
+ Multi-datacenter clustering — geographically, even globally disbursed. but same logical cluster
+ Node Roles, High Availability & Failover Strategies
+ Primary-replica (Active-passive; writes to primary only; read-only replicas; “hot standby” modes)
+ Peer-to-peer, leaderless (Active-Active, multi primaries; can write to any replica; no SPOF)
+ Load balancing (client side or service in front of database)
+ Data Replication & Sharding Strategies
+ Replication Factors & Consistency Levels
+ Horizontal Scalability: Manual vs. Auto-sharding
+ Topology Awareness: Rack-awareness, Datacenter-awareness
MongoDB — the leading document store
Part of base offering
Can be added, but not part of base
22. 22
+ Clustering & Distribution Strategies
+ Local clustering — multiple nodes in the same datacenter share updates
+ Cross-cluster updates — multiple clusters can share data between them
+ Multi-datacenter clustering — geographically, even globally disbursed. but same logical cluster*
+ Node Roles, High Availability & Failover Strategies
+ Primary-replica (Active-passive; writes to primary only; read-only replicas; “hot standby” modes)
+ Peer-to-peer, leaderless (Active-Active, multi primaries; can write to any replica; no SPOF)*
+ Load balancing (client side or service in front of database)
+ Data Replication & Sharding Strategies
+ Replication Factors & Consistency Levels (e.g., strong locally; causal consistency in active-active*)
+ Horizontal Scalability: Manual vs. Auto-sharding
+ Topology Awareness: Rack-awareness, Datacenter-awareness
Redis — key-value in-memory DB/cache
* Redis Enterprise feature
Part of base offering
Can be added, but not part of base
23. 23
+ Clustering & Distribution Strategies
+ Local clustering — multiple nodes in the same datacenter share updates
+ Cross-cluster updates — multiple clusters can share data between them
+ Multi-datacenter clustering — geographically, even globally disbursed. but same logical cluster
+ Node Roles, High Availability & Failover Strategies
+ Primary-replica (Active-passive; writes to primary only; read-only replicas; “hot standby” modes)
+ Peer-to-peer, leaderless (Active-Active, multi primaries; can write to any replica; no SPOF)
+ Load balancing (client side or service in front of database*)
+ Data Replication & Sharding Strategies
+ Replication Factors & Consistency Levels
+ Horizontal Scalability: Manual vs. Auto-sharding
+ Topology Awareness: Rack-awareness, Datacenter-awareness
ScyllaDB
Part of base offering
* For DynamoDB-compatible API
26. 26
The Trend for SQL
+ Google Trends for “SQL”
are at 25% rate of 2004
+ Book citations for “SQL”
peaked in 2008 and
were down to 28% of
that rate by 2019
+ Back to 1994 levels of
interest, basically
+ Still dwarfs other
database terms like
“NoSQL” or “NewSQL” or
“RDBMS”
+ No single term or
technology sums up the
distributed database
market anymore
27. 27
+ Cambrian Explosion will Continue — “What is a database anyway?”
+ Distributed Databases of all kinds
+ Distributed Streaming — “Kafka as a database?” (kSQL says “Yes!”)
+ Distributed Ledgers — “Blockchains/DAGs as a database?”
+ Further fragmentation of the market
+ NoSQL + SQL blending increasingly
+ Evolution of NoSQL back to SQL assumptions
+ Adding back Strong Consistency, Schema Constraints, Strict Typing
Where are Distributed Databases Going?
28. 28
+ Elasticity — Faster provisioning/decommissioning, autoscaling
+ Uncoupling Compute from Storage — Tiered Storage, Plug-in Storage
+ Data over Time
+ Built for Event Streaming, Time Series
+ Data over Space
+ Geospatial queries, Geoindexing
+ Geographic / political boundaries — GDPR, data localization
regulatory compliance
Further Trends in Distributed Databases
29. 29
+ Increasing Focus on Developer Enablement and Developer Experience (DX)
+ APIs for extensibility: extensions, plugins, modules, add-ons, integration layers
+ Database Specific: PostgreSQL extensions, Redis modules
+ Cross-industry: GraphQL, OpenAPI (Swagger), etc.
+ AI/ML integration and incorporation into databases
+ “Building models where your data resides” — Martin Heller (Apr 2021)
+ Amazon Redshift ML
+ BigQuery ML
+ Oracle, Db2, Microsoft SQL Server
Database as a Development Platform
30. 30
+ Tighter Coupling of Data Engineering + Data Sciences +
Operations
+ Repairing rifts of the past decade
+ Bridging huge divides between people and systems
+ From “Data Pipelining” (production-oriented) to...
+ “Data Supply Chains” (consumption-oriented)
+ Like “Software Supply Chain,” but for data and data products.
Data Teaming
31. 31
+ Specializing databases to run in the cloud (and cloud-only)
+ Providing “concierge” services
+ Ecosystem: can integrate into cloud vendor’s (or partners’) offerings
+ Managed for you — at a price
+ Making Open Source databases easier to run on infrastructural level
+ Making self-managed operations simpler
+ Flexibility: can run on premises or in the cloud
+ Self-service model — so long as you have the skillz
We Need Different Kinds of “Easy”
33. 33
+ Kostja Osipov
+ Serge Leontiev
Thanks
Any errors, omissions, misinterpretations,
misrepresentations or misunderstandings
are purely my own.
Please send suggestions and corrections
to peter@scylladb.com
People who helped educate me
Disclaimer
35. United States
2445 Faber St, Suite #200
Palo Alto, CA USA 94303
Israel
Maskit 4
Herzliya, Israel 4673304
www.scylladb.com
@scylladb
Learn NoSQL for free!
university.scylladb.com
@petercorless