SlideShare a Scribd company logo
© DataStax, All Rights Reserved.
Cassandra and the Cloud (2018 edition)
Jonathan Ellis
© DataStax, All Rights Reserved.
“Databases”
© DataStax, All Rights Reserved.
Stuff everyone agrees on
© DataStax, All Rights Reserved.
Stuff (Almost) Everyone Agrees On
1. Eventual Consistency is useful
© DataStax, All Rights Reserved.
Eventual Consistency (CP edition)
© DataStax, All Rights Reserved.
Eventual Consistency (AP edition)
© DataStax, All Rights Reserved.
Default consistency levels
1. Cassandra: Eventual
2. Dynamo: Eventual
3. CosmosDB: Eventual (“Session”)
4. Spanner: ACID (no EC)
© DataStax, All Rights Reserved.
Stuff (Almost) Everyone Agrees On
1. Eventual Consistency is useful
2. Automatic partitioning doesn’t work
© DataStax, All Rights Reserved.
H-Store (2012)
© DataStax, All Rights Reserved.
Partitioning approaches
1. Cassandra: Explicit
2. Dynamo: Explicit
3. CosmosDB: Explicit
4. Spanner: Explicit
© DataStax, All Rights Reserved.
Stuff (Almost) Everyone Agrees On
1. Eventual Consistency is useful
2. Automatic partitioning doesn’t work
3. SQL is a pretty okay query language
© DataStax, All Rights Reserved.
© DataStax, All Rights Reserved.
Query APIs
1. Cassandra: CQL, inspired by SQL
2. DynamoDB: Actually still pretty first-gen NoSQL
3. CosmosDB: “SQL”
4. Spanner: “SQL”
© DataStax, All Rights Reserved.
Stuff (Almost) Everyone Agrees On
1. Eventual Consistency is useful
2. Automatic partitioning doesn’t work
3. SQL is a pretty okay query language
4. … that’s about it
© DataStax, All Rights Reserved.
Thomas Sowell
There are no solutions.
Only tradeoffs.
© DataStax, All Rights Reserved.
Cassandra
© DataStax, All Rights Reserved.
Data Model: tabular, with nested content
CREATE TABLE notifications (
target_user text,
notification_id timeuuid,
source_id uuid,
source_type text,
activity text,
PRIMARY KEY (target_user, notification_id)
)
WITH CLUSTERING ORDER BY (notification_id DESC);
© DataStax, All Rights Reserved.
target_user notification_id source_id source_type activity
nick e1bd2bcb- d972b679- photo tom liked
nick 321998c- d972b679- photo jake commented
nick ea1c5d35- 88a049d5- user mike created
account
nick 5321998c- 64613f27- photo tom commented
nick 07581439- 076eab7e- user tyler created
account
mike 1c34467a- f04e309f- user tom created
account
© DataStax, All Rights Reserved.
Collections
CREATE TABLE users (
id uuid PRIMARY KEY,
name text,
state text,
birth_date int,
email_addresses set<text>
);
© DataStax, All Rights Reserved.
User-defined Types
CREATE TYPE address (
street text,
city text,
zip_code int,
phones set<text>
)
CREATE TABLE users (
id uuid PRIMARY KEY,
name text,
addresses map<text, address>
)
SELECT id, name, addresses.city, addresses.phones FROM users;
id | name | addresses.city | addresses.phones
--------------------+----------------+--------------------------
63bf691f | jbellis | Austin | {'512-4567', '512-9999'}
© DataStax, All Rights Reserved.
JSON
INSERT INTO users JSON
'{"id": "0514e410-",
"name": "jbellis",
"addresses": {"home": {"street": "9920 Cassandra Ave",
"city": "Austin",
"zip_code": 78700,
"phones": ["1238614789"]}}}';
© DataStax, All Rights Reserved.
Without nesting
© DataStax, All Rights Reserved.
With nesting
© DataStax, All Rights Reserved.
Consistency Levels
© DataStax, All Rights Reserved.
Consistency Levels
© DataStax, All Rights Reserved.
Consistency Levels
© DataStax, All Rights Reserved.
Multi-region
1. Synchronous writes locally; async globally
2. Serve reads and writes for any row in any region
3. DR is “free”
4. Client-level support
© DataStax, All Rights Reserved.
Notable features
● Lightweight Transactions (Paxos, expensive)
● Materialized Views
● User-defined functions
● Strict schema
© DataStax, All Rights Reserved.
© DataStax, All Rights Reserved.
© DataStax, All Rights Reserved.
Schema confusion
{"userid": "2452347",
"name": "jbellis",
... }
{"userid": 2452348,
"name": "jshook",
... }
{"user_id": 2452349,
"name": "jlacefield",
... }
© DataStax, All Rights Reserved.
DynamoDB
© DataStax, All Rights Reserved.
CP single partition
● Original Dynamo was AP
● DynamoDB offers Strong/Eventual read consistency
● But
Conditional writes are the same price as regular writes
And: “All write requests are applied in the order in which
they were received”
© DataStax, All Rights Reserved.
Data model
● “Map of maps”
● Primary key, sort key
© DataStax, All Rights Reserved.
Data Model
© DataStax, All Rights Reserved.
Multi-region
● New feature (late 2017): Global tables
● Shards and replicates a table across regions
● Each shard can only be written to by its master region
© DataStax, All Rights Reserved.
Notable features
● Global indexes
● Change feed
● “DynamoDB Transaction Library”
“A put that does not contend with any other
simultaneous puts can be expected to perform 7N + 4
writes as the original operation, where N is the number
of requests in the transaction.”
© DataStax, All Rights Reserved.
Sidebar: cross-partition txns in AP?
© DataStax, All Rights Reserved.
CosmosDB
© DataStax, All Rights Reserved.
Data model:
CP Single Partition
© DataStax, All Rights Reserved.
“Multi model”
● NOT just “APIs”
● More like MyRocks/MongoRocks than C* CQL/JSON
● What they have in common:
● Hash partitioning
● Undefined sorting within partitions; use ORDER BY
© DataStax, All Rights Reserved.
Data model
© DataStax, All Rights Reserved.
“Multi model”
● Not all features supported everywhere
● Azure Functions
● Change Feed
● TLDR use SQL/Document API
© DataStax, All Rights Reserved.
© DataStax, All Rights Reserved.
SQL support and extensions
SELECT c.givenName
FROM Families f
JOIN c IN f.children
WHERE f.id = 'WakefieldFamily'
ORDER BY f.address.city ASC
“The language lets you refer to nodes of the tree at any
arbitrary depth, like Node1.Node2.Node3…..NodeM”
© DataStax, All Rights Reserved.
© DataStax, All Rights Reserved.
Consistency Levels
© DataStax, All Rights Reserved.
Consistency Levels
● This is probably still too many
(“About 73% of Azure Cosmos DB tenants use session
consistency and 20% prefer bounded staleness.”)
© DataStax, All Rights Reserved.
Implementation clue?
© DataStax, All Rights Reserved.
Multi-region
● Claims local read/writes with async replication between
regions
● But, also claims ACID single-partition transactions in
stored procedures
● You can’t have both! Something doesn’t add up!
© DataStax, All Rights Reserved.
Notable features
● Everything is indexed
99p < 20% overhead
● “Attachment” special document type for blobs
Main purpose seems to be to allow PUTing data easily
● Stored procedures
Including (single-partition) transactions
Only for SQL API
● Change feed
© DataStax, All Rights Reserved.
Spanner
© DataStax, All Rights Reserved.
Data model: CP multi-partition
● “Reuse existing SQL skills to query data in Cloud
Spanner using familiar, industry-standard ANSI 2011
SQL.”
(Actually a fairly small subset)
(And only for SELECT)
© DataStax, All Rights Reserved.
Interleaved/child tables
CREATE TABLE Singers (
SingerId INT64 NOT NULL,
FirstName STRING(1024),
LastName STRING(1024),
SingerInfo BYTES(MAX),
) PRIMARY KEY (SingerId);
CREATE TABLE Albums (
SingerId INT64 NOT NULL,
AlbumId INT64 NOT NULL,
AlbumTitle STRING(MAX),
) PRIMARY KEY (SingerId, AlbumId),
INTERLEAVE IN PARENT Singers ON DELETE CASCADE;
© DataStax, All Rights Reserved.
© DataStax, All Rights Reserved.
Multi-region
© DataStax, All Rights Reserved.
Multi-region, TLDR
● You can replicate to multiple regions
● Only one region can accept writes at a time
● Opinion: it is not often useful to scale reads without also
scaling writes
© DataStax, All Rights Reserved.
Notable features
● Full multi-partition ACID 2PC
(using Paxos replication groups)
● DFS-based, not local storage
© DataStax, All Rights Reserved.
The price of ACID
© DataStax, All Rights Reserved.
Writes slow down (indexed) reads
Quizlet:
“Bulk writes severely impact the performance of queries
using the secondary index [becuase] a write with a
secondary index updates many splits, which [since
Spanner uses pessimistic locking] creates contention for
reads that use that secondary index.”
© DataStax, All Rights Reserved.
Practical considerations
© DataStax, All Rights Reserved.
Cassandra
● Run anywhere you like--but you have to run it
○ But: DataStax Managed Cloud, Instaclustr
○ Also: DataStax Remote DBA
● Storage closely tied to compute
○ But: everyone struggles with this
© DataStax, All Rights Reserved.
Multi-cloud
● JP Morgan: “We have seen increasingly all the
customers we talk to, almost exclusively large
mid-market to large enterprise, all now are embracing
multi-cloud as a specific strategy.”
●
© DataStax, All Rights Reserved.
DynamoDB
● Request capacity tied to “partitions” [pp]
○ pp count = max (rc / 3000, wc / 1000, st / 10 GB)
● Subtle implication: capacity / pp decreases as storage
volume increases
○ Non-uniform: pp request capacity halved when shard splits
● Subtle implication 2: bulk loads will wreck your planning
© DataStax, All Rights Reserved.
“Best practices for tables”
● Bulk load 20M items = 20 GB
● Target 30 minutes = 11,000 write capacity = 11 pps
● Post bulk load steady state 200 req/s = 18 req/pp
● No way to reduce partition count
© DataStax, All Rights Reserved.
DynamoDB provisioning in the wild
● You Probably Shouldn’t Use DynamoDB
○ Hacker News Discussion
● The Million Dollar Engineering Challenge
○ Hacker News discussion
© DataStax, All Rights Reserved.
CosmosDB
● Like DynamoDB, but underdocumented
● Partition and scale in Azure Cosmos DB
○ Unspecified: max pp storage size, max pp request capacity
© DataStax, All Rights Reserved.
Spanner
● DFS architecture means it doesn’t have the provisioning
problem
● Priced per node (~$8500 min per 2 TB data, per year) +
$3600 / TB / yr
○ Doesn’t appear to be part of the GCP free usage tier
● Competing with Cloud Datastore, Cloud BigTable
© DataStax, All Rights Reserved.
Recommendations
© DataStax, All Rights Reserved.
Never recommended
● DynamoDB: Lack of nesting is too limiting
● Spanner: All ACID, all the time is too expensive
© DataStax, All Rights Reserved.
CosmosDB vs Cassandra
● Rough feature parity
○ Partitioned rows (documents) with nesting
○ Default EC with opt-in to stronger options
● Cassandra
○ Materialized views
○ True multi-model
○ Predictable provisioning
● CosmosDB
○ Stored procedures
○ ORDER BY
○ Change feed
© DataStax, All Rights Reserved.
CosmosDB vs Cassandra
● Given rough feature parity, why would you pick the one
that only runs in a single cloud?
● Hobbyist / less than one (three) C* VM?
© DataStax, All Rights Reserved.
DataStax Enterprise
© DataStax, All Rights Reserved.
Distributed Systems Reading List
● Bigtable: A Distributed Storage System for Structured
Data
● Dynamo: Amazon’s Highly Available Key-value Store
● Cassandra - A Decentralized Structured Storage
System [annotated by Jonathan Ellis]
● Skew-aware automatic database partitioning in
shared-nothing, parallel OLTP systems
● Calvin: Fast Distributed Transactions for Partitioned
Database Systems
● Spanner: Google's Globally-Distributed Database
© DataStax, All Rights Reserved.
Thank you

More Related Content

What's hot

Lecture 40 1
Lecture 40 1Lecture 40 1
Lecture 40 1
patib5
 
Large partition in Cassandra
Large partition in CassandraLarge partition in Cassandra
Large partition in Cassandra
Shogo Hoshii
 
N1QL New Features in couchbase 7.0
N1QL New Features in couchbase 7.0N1QL New Features in couchbase 7.0
N1QL New Features in couchbase 7.0
Keshav Murthy
 
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopIntroduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and Hadoop
Patricia Gorla
 
NoSQL Introduction
NoSQL IntroductionNoSQL Introduction
NoSQL Introduction
John Kerley-Weeks
 
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
DataStax
 
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Keshav Murthy
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
Jim Hatcher
 
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
DataStax
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
DataStax
 
druid.io
druid.iodruid.io
Boundary Front end tech talk: how it works
Boundary Front end tech talk: how it worksBoundary Front end tech talk: how it works
Boundary Front end tech talk: how it works
Boundary
 
Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql Database
Prashant Gupta
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Natalino Busa
 
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTOClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
Altinity Ltd
 
Writing A Foreign Data Wrapper
Writing A Foreign Data WrapperWriting A Foreign Data Wrapper
Writing A Foreign Data Wrapper
psoo1978
 
Accessing Databases from R
Accessing Databases from RAccessing Databases from R
Accessing Databases from R
Jeffrey Breen
 
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraАндрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Olga Lavrentieva
 
23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...
23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...
23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...
Amazon Web Services
 
Druid
DruidDruid

What's hot (20)

Lecture 40 1
Lecture 40 1Lecture 40 1
Lecture 40 1
 
Large partition in Cassandra
Large partition in CassandraLarge partition in Cassandra
Large partition in Cassandra
 
N1QL New Features in couchbase 7.0
N1QL New Features in couchbase 7.0N1QL New Features in couchbase 7.0
N1QL New Features in couchbase 7.0
 
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopIntroduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and Hadoop
 
NoSQL Introduction
NoSQL IntroductionNoSQL Introduction
NoSQL Introduction
 
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
 
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
 
druid.io
druid.iodruid.io
druid.io
 
Boundary Front end tech talk: how it works
Boundary Front end tech talk: how it worksBoundary Front end tech talk: how it works
Boundary Front end tech talk: how it works
 
Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql Database
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
 
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTOClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
 
Writing A Foreign Data Wrapper
Writing A Foreign Data WrapperWriting A Foreign Data Wrapper
Writing A Foreign Data Wrapper
 
Accessing Databases from R
Accessing Databases from RAccessing Databases from R
Accessing Databases from R
 
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraАндрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
 
23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...
23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...
23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...
 
Druid
DruidDruid
Druid
 

Similar to Data day texas: Cassandra and the Cloud

Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databases
jbellis
 
DataStax 6 and Beyond
DataStax 6 and BeyondDataStax 6 and Beyond
DataStax 6 and Beyond
David Jones-Gilardi
 
implementation of a big data architecture for real-time analytics with data s...
implementation of a big data architecture for real-time analytics with data s...implementation of a big data architecture for real-time analytics with data s...
implementation of a big data architecture for real-time analytics with data s...
Joseph Arriola
 
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
DataStax
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
MariaDB Corporation
 
Flight on Zeppelin with Apache Spark & Cassandra
Flight on Zeppelin with Apache Spark & CassandraFlight on Zeppelin with Apache Spark & Cassandra
Flight on Zeppelin with Apache Spark & Cassandra
Alex Ott
 
Safer restarts, faster streaming, and better repair, just a glimpse of cassan...
Safer restarts, faster streaming, and better repair, just a glimpse of cassan...Safer restarts, faster streaming, and better repair, just a glimpse of cassan...
Safer restarts, faster streaming, and better repair, just a glimpse of cassan...
Vinay Kumar Chella
 
Everything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBEverything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDB
jhugg
 
Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
DataWorks Summit
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSql
Omid Vahdaty
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
Guillaume Lefranc
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
Omid Vahdaty
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Neville Li
 
Getting Started With Amazon Redshift
Getting Started With Amazon Redshift Getting Started With Amazon Redshift
Getting Started With Amazon Redshift
Matillion
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
Neville Li
 
Understanding the architecture of MariaDB ColumnStore
Understanding the architecture of MariaDB ColumnStoreUnderstanding the architecture of MariaDB ColumnStore
Understanding the architecture of MariaDB ColumnStore
MariaDB plc
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
Kimmo Kantojärvi
 
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
Spark Summit
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Amazon Web Services
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Amazon Web Services
 

Similar to Data day texas: Cassandra and the Cloud (20)

Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databases
 
DataStax 6 and Beyond
DataStax 6 and BeyondDataStax 6 and Beyond
DataStax 6 and Beyond
 
implementation of a big data architecture for real-time analytics with data s...
implementation of a big data architecture for real-time analytics with data s...implementation of a big data architecture for real-time analytics with data s...
implementation of a big data architecture for real-time analytics with data s...
 
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
 
Flight on Zeppelin with Apache Spark & Cassandra
Flight on Zeppelin with Apache Spark & CassandraFlight on Zeppelin with Apache Spark & Cassandra
Flight on Zeppelin with Apache Spark & Cassandra
 
Safer restarts, faster streaming, and better repair, just a glimpse of cassan...
Safer restarts, faster streaming, and better repair, just a glimpse of cassan...Safer restarts, faster streaming, and better repair, just a glimpse of cassan...
Safer restarts, faster streaming, and better repair, just a glimpse of cassan...
 
Everything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBEverything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDB
 
Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSql
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
 
Getting Started With Amazon Redshift
Getting Started With Amazon Redshift Getting Started With Amazon Redshift
Getting Started With Amazon Redshift
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Understanding the architecture of MariaDB ColumnStore
Understanding the architecture of MariaDB ColumnStoreUnderstanding the architecture of MariaDB ColumnStore
Understanding the architecture of MariaDB ColumnStore
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 

More from jbellis

Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
jbellis
 
Cassandra Summit 2015
Cassandra Summit 2015Cassandra Summit 2015
Cassandra Summit 2015
jbellis
 
Cassandra summit keynote 2014
Cassandra summit keynote 2014Cassandra summit keynote 2014
Cassandra summit keynote 2014
jbellis
 
Cassandra 2.1
Cassandra 2.1Cassandra 2.1
Cassandra 2.1
jbellis
 
Tokyo cassandra conference 2014
Tokyo cassandra conference 2014Tokyo cassandra conference 2014
Tokyo cassandra conference 2014
jbellis
 
Cassandra Summit EU 2013
Cassandra Summit EU 2013Cassandra Summit EU 2013
Cassandra Summit EU 2013
jbellis
 
London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0
jbellis
 
Cassandra Summit 2013 Keynote
Cassandra Summit 2013 KeynoteCassandra Summit 2013 Keynote
Cassandra Summit 2013 Keynote
jbellis
 
Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012
jbellis
 
Top five questions to ask when choosing a big data solution
Top five questions to ask when choosing a big data solutionTop five questions to ask when choosing a big data solution
Top five questions to ask when choosing a big data solution
jbellis
 
State of Cassandra 2012
State of Cassandra 2012State of Cassandra 2012
State of Cassandra 2012
jbellis
 
Massively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache CassandraMassively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache Cassandra
jbellis
 
Cassandra 1.1
Cassandra 1.1Cassandra 1.1
Cassandra 1.1
jbellis
 
Pycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from JavaPycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from Java
jbellis
 
Apache Cassandra: NoSQL in the enterprise
Apache Cassandra: NoSQL in the enterpriseApache Cassandra: NoSQL in the enterprise
Apache Cassandra: NoSQL in the enterprise
jbellis
 
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
jbellis
 
Cassandra at High Performance Transaction Systems 2011
Cassandra at High Performance Transaction Systems 2011Cassandra at High Performance Transaction Systems 2011
Cassandra at High Performance Transaction Systems 2011
jbellis
 
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
jbellis
 
What python can learn from java
What python can learn from javaWhat python can learn from java
What python can learn from java
jbellis
 
State of Cassandra, 2011
State of Cassandra, 2011State of Cassandra, 2011
State of Cassandra, 2011
jbellis
 

More from jbellis (20)

Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
Cassandra Summit 2015
Cassandra Summit 2015Cassandra Summit 2015
Cassandra Summit 2015
 
Cassandra summit keynote 2014
Cassandra summit keynote 2014Cassandra summit keynote 2014
Cassandra summit keynote 2014
 
Cassandra 2.1
Cassandra 2.1Cassandra 2.1
Cassandra 2.1
 
Tokyo cassandra conference 2014
Tokyo cassandra conference 2014Tokyo cassandra conference 2014
Tokyo cassandra conference 2014
 
Cassandra Summit EU 2013
Cassandra Summit EU 2013Cassandra Summit EU 2013
Cassandra Summit EU 2013
 
London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0
 
Cassandra Summit 2013 Keynote
Cassandra Summit 2013 KeynoteCassandra Summit 2013 Keynote
Cassandra Summit 2013 Keynote
 
Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012
 
Top five questions to ask when choosing a big data solution
Top five questions to ask when choosing a big data solutionTop five questions to ask when choosing a big data solution
Top five questions to ask when choosing a big data solution
 
State of Cassandra 2012
State of Cassandra 2012State of Cassandra 2012
State of Cassandra 2012
 
Massively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache CassandraMassively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache Cassandra
 
Cassandra 1.1
Cassandra 1.1Cassandra 1.1
Cassandra 1.1
 
Pycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from JavaPycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from Java
 
Apache Cassandra: NoSQL in the enterprise
Apache Cassandra: NoSQL in the enterpriseApache Cassandra: NoSQL in the enterprise
Apache Cassandra: NoSQL in the enterprise
 
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
 
Cassandra at High Performance Transaction Systems 2011
Cassandra at High Performance Transaction Systems 2011Cassandra at High Performance Transaction Systems 2011
Cassandra at High Performance Transaction Systems 2011
 
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
 
What python can learn from java
What python can learn from javaWhat python can learn from java
What python can learn from java
 
State of Cassandra, 2011
State of Cassandra, 2011State of Cassandra, 2011
State of Cassandra, 2011
 

Recently uploaded

5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
Data Hops
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
marufrahmanstratejm
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 

Recently uploaded (20)

5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 

Data day texas: Cassandra and the Cloud

  • 1. © DataStax, All Rights Reserved. Cassandra and the Cloud (2018 edition) Jonathan Ellis
  • 2. © DataStax, All Rights Reserved. “Databases”
  • 3. © DataStax, All Rights Reserved. Stuff everyone agrees on
  • 4. © DataStax, All Rights Reserved. Stuff (Almost) Everyone Agrees On 1. Eventual Consistency is useful
  • 5. © DataStax, All Rights Reserved. Eventual Consistency (CP edition)
  • 6. © DataStax, All Rights Reserved. Eventual Consistency (AP edition)
  • 7. © DataStax, All Rights Reserved. Default consistency levels 1. Cassandra: Eventual 2. Dynamo: Eventual 3. CosmosDB: Eventual (“Session”) 4. Spanner: ACID (no EC)
  • 8. © DataStax, All Rights Reserved. Stuff (Almost) Everyone Agrees On 1. Eventual Consistency is useful 2. Automatic partitioning doesn’t work
  • 9. © DataStax, All Rights Reserved. H-Store (2012)
  • 10. © DataStax, All Rights Reserved. Partitioning approaches 1. Cassandra: Explicit 2. Dynamo: Explicit 3. CosmosDB: Explicit 4. Spanner: Explicit
  • 11. © DataStax, All Rights Reserved. Stuff (Almost) Everyone Agrees On 1. Eventual Consistency is useful 2. Automatic partitioning doesn’t work 3. SQL is a pretty okay query language
  • 12. © DataStax, All Rights Reserved.
  • 13. © DataStax, All Rights Reserved. Query APIs 1. Cassandra: CQL, inspired by SQL 2. DynamoDB: Actually still pretty first-gen NoSQL 3. CosmosDB: “SQL” 4. Spanner: “SQL”
  • 14. © DataStax, All Rights Reserved. Stuff (Almost) Everyone Agrees On 1. Eventual Consistency is useful 2. Automatic partitioning doesn’t work 3. SQL is a pretty okay query language 4. … that’s about it
  • 15. © DataStax, All Rights Reserved. Thomas Sowell There are no solutions. Only tradeoffs.
  • 16. © DataStax, All Rights Reserved. Cassandra
  • 17. © DataStax, All Rights Reserved. Data Model: tabular, with nested content CREATE TABLE notifications ( target_user text, notification_id timeuuid, source_id uuid, source_type text, activity text, PRIMARY KEY (target_user, notification_id) ) WITH CLUSTERING ORDER BY (notification_id DESC);
  • 18. © DataStax, All Rights Reserved. target_user notification_id source_id source_type activity nick e1bd2bcb- d972b679- photo tom liked nick 321998c- d972b679- photo jake commented nick ea1c5d35- 88a049d5- user mike created account nick 5321998c- 64613f27- photo tom commented nick 07581439- 076eab7e- user tyler created account mike 1c34467a- f04e309f- user tom created account
  • 19. © DataStax, All Rights Reserved. Collections CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date int, email_addresses set<text> );
  • 20. © DataStax, All Rights Reserved. User-defined Types CREATE TYPE address ( street text, city text, zip_code int, phones set<text> ) CREATE TABLE users ( id uuid PRIMARY KEY, name text, addresses map<text, address> ) SELECT id, name, addresses.city, addresses.phones FROM users; id | name | addresses.city | addresses.phones --------------------+----------------+-------------------------- 63bf691f | jbellis | Austin | {'512-4567', '512-9999'}
  • 21. © DataStax, All Rights Reserved. JSON INSERT INTO users JSON '{"id": "0514e410-", "name": "jbellis", "addresses": {"home": {"street": "9920 Cassandra Ave", "city": "Austin", "zip_code": 78700, "phones": ["1238614789"]}}}';
  • 22. © DataStax, All Rights Reserved. Without nesting
  • 23. © DataStax, All Rights Reserved. With nesting
  • 24. © DataStax, All Rights Reserved. Consistency Levels
  • 25. © DataStax, All Rights Reserved. Consistency Levels
  • 26. © DataStax, All Rights Reserved. Consistency Levels
  • 27. © DataStax, All Rights Reserved. Multi-region 1. Synchronous writes locally; async globally 2. Serve reads and writes for any row in any region 3. DR is “free” 4. Client-level support
  • 28. © DataStax, All Rights Reserved. Notable features ● Lightweight Transactions (Paxos, expensive) ● Materialized Views ● User-defined functions ● Strict schema
  • 29. © DataStax, All Rights Reserved.
  • 30. © DataStax, All Rights Reserved.
  • 31. © DataStax, All Rights Reserved. Schema confusion {"userid": "2452347", "name": "jbellis", ... } {"userid": 2452348, "name": "jshook", ... } {"user_id": 2452349, "name": "jlacefield", ... }
  • 32. © DataStax, All Rights Reserved. DynamoDB
  • 33. © DataStax, All Rights Reserved. CP single partition ● Original Dynamo was AP ● DynamoDB offers Strong/Eventual read consistency ● But Conditional writes are the same price as regular writes And: “All write requests are applied in the order in which they were received”
  • 34. © DataStax, All Rights Reserved. Data model ● “Map of maps” ● Primary key, sort key
  • 35. © DataStax, All Rights Reserved. Data Model
  • 36. © DataStax, All Rights Reserved. Multi-region ● New feature (late 2017): Global tables ● Shards and replicates a table across regions ● Each shard can only be written to by its master region
  • 37. © DataStax, All Rights Reserved. Notable features ● Global indexes ● Change feed ● “DynamoDB Transaction Library” “A put that does not contend with any other simultaneous puts can be expected to perform 7N + 4 writes as the original operation, where N is the number of requests in the transaction.”
  • 38. © DataStax, All Rights Reserved. Sidebar: cross-partition txns in AP?
  • 39. © DataStax, All Rights Reserved. CosmosDB
  • 40. © DataStax, All Rights Reserved. Data model: CP Single Partition
  • 41. © DataStax, All Rights Reserved. “Multi model” ● NOT just “APIs” ● More like MyRocks/MongoRocks than C* CQL/JSON ● What they have in common: ● Hash partitioning ● Undefined sorting within partitions; use ORDER BY
  • 42. © DataStax, All Rights Reserved. Data model
  • 43. © DataStax, All Rights Reserved. “Multi model” ● Not all features supported everywhere ● Azure Functions ● Change Feed ● TLDR use SQL/Document API
  • 44. © DataStax, All Rights Reserved.
  • 45. © DataStax, All Rights Reserved. SQL support and extensions SELECT c.givenName FROM Families f JOIN c IN f.children WHERE f.id = 'WakefieldFamily' ORDER BY f.address.city ASC “The language lets you refer to nodes of the tree at any arbitrary depth, like Node1.Node2.Node3…..NodeM”
  • 46. © DataStax, All Rights Reserved.
  • 47. © DataStax, All Rights Reserved. Consistency Levels
  • 48. © DataStax, All Rights Reserved. Consistency Levels ● This is probably still too many (“About 73% of Azure Cosmos DB tenants use session consistency and 20% prefer bounded staleness.”)
  • 49. © DataStax, All Rights Reserved. Implementation clue?
  • 50. © DataStax, All Rights Reserved. Multi-region ● Claims local read/writes with async replication between regions ● But, also claims ACID single-partition transactions in stored procedures ● You can’t have both! Something doesn’t add up!
  • 51. © DataStax, All Rights Reserved. Notable features ● Everything is indexed 99p < 20% overhead ● “Attachment” special document type for blobs Main purpose seems to be to allow PUTing data easily ● Stored procedures Including (single-partition) transactions Only for SQL API ● Change feed
  • 52. © DataStax, All Rights Reserved. Spanner
  • 53. © DataStax, All Rights Reserved. Data model: CP multi-partition ● “Reuse existing SQL skills to query data in Cloud Spanner using familiar, industry-standard ANSI 2011 SQL.” (Actually a fairly small subset) (And only for SELECT)
  • 54. © DataStax, All Rights Reserved. Interleaved/child tables CREATE TABLE Singers ( SingerId INT64 NOT NULL, FirstName STRING(1024), LastName STRING(1024), SingerInfo BYTES(MAX), ) PRIMARY KEY (SingerId); CREATE TABLE Albums ( SingerId INT64 NOT NULL, AlbumId INT64 NOT NULL, AlbumTitle STRING(MAX), ) PRIMARY KEY (SingerId, AlbumId), INTERLEAVE IN PARENT Singers ON DELETE CASCADE;
  • 55. © DataStax, All Rights Reserved.
  • 56. © DataStax, All Rights Reserved. Multi-region
  • 57. © DataStax, All Rights Reserved. Multi-region, TLDR ● You can replicate to multiple regions ● Only one region can accept writes at a time ● Opinion: it is not often useful to scale reads without also scaling writes
  • 58. © DataStax, All Rights Reserved. Notable features ● Full multi-partition ACID 2PC (using Paxos replication groups) ● DFS-based, not local storage
  • 59. © DataStax, All Rights Reserved. The price of ACID
  • 60. © DataStax, All Rights Reserved. Writes slow down (indexed) reads Quizlet: “Bulk writes severely impact the performance of queries using the secondary index [becuase] a write with a secondary index updates many splits, which [since Spanner uses pessimistic locking] creates contention for reads that use that secondary index.”
  • 61. © DataStax, All Rights Reserved. Practical considerations
  • 62. © DataStax, All Rights Reserved. Cassandra ● Run anywhere you like--but you have to run it ○ But: DataStax Managed Cloud, Instaclustr ○ Also: DataStax Remote DBA ● Storage closely tied to compute ○ But: everyone struggles with this
  • 63. © DataStax, All Rights Reserved. Multi-cloud ● JP Morgan: “We have seen increasingly all the customers we talk to, almost exclusively large mid-market to large enterprise, all now are embracing multi-cloud as a specific strategy.” ●
  • 64. © DataStax, All Rights Reserved. DynamoDB ● Request capacity tied to “partitions” [pp] ○ pp count = max (rc / 3000, wc / 1000, st / 10 GB) ● Subtle implication: capacity / pp decreases as storage volume increases ○ Non-uniform: pp request capacity halved when shard splits ● Subtle implication 2: bulk loads will wreck your planning
  • 65. © DataStax, All Rights Reserved. “Best practices for tables” ● Bulk load 20M items = 20 GB ● Target 30 minutes = 11,000 write capacity = 11 pps ● Post bulk load steady state 200 req/s = 18 req/pp ● No way to reduce partition count
  • 66. © DataStax, All Rights Reserved. DynamoDB provisioning in the wild ● You Probably Shouldn’t Use DynamoDB ○ Hacker News Discussion ● The Million Dollar Engineering Challenge ○ Hacker News discussion
  • 67. © DataStax, All Rights Reserved. CosmosDB ● Like DynamoDB, but underdocumented ● Partition and scale in Azure Cosmos DB ○ Unspecified: max pp storage size, max pp request capacity
  • 68. © DataStax, All Rights Reserved. Spanner ● DFS architecture means it doesn’t have the provisioning problem ● Priced per node (~$8500 min per 2 TB data, per year) + $3600 / TB / yr ○ Doesn’t appear to be part of the GCP free usage tier ● Competing with Cloud Datastore, Cloud BigTable
  • 69. © DataStax, All Rights Reserved. Recommendations
  • 70. © DataStax, All Rights Reserved. Never recommended ● DynamoDB: Lack of nesting is too limiting ● Spanner: All ACID, all the time is too expensive
  • 71. © DataStax, All Rights Reserved. CosmosDB vs Cassandra ● Rough feature parity ○ Partitioned rows (documents) with nesting ○ Default EC with opt-in to stronger options ● Cassandra ○ Materialized views ○ True multi-model ○ Predictable provisioning ● CosmosDB ○ Stored procedures ○ ORDER BY ○ Change feed
  • 72. © DataStax, All Rights Reserved. CosmosDB vs Cassandra ● Given rough feature parity, why would you pick the one that only runs in a single cloud? ● Hobbyist / less than one (three) C* VM?
  • 73. © DataStax, All Rights Reserved. DataStax Enterprise
  • 74. © DataStax, All Rights Reserved. Distributed Systems Reading List ● Bigtable: A Distributed Storage System for Structured Data ● Dynamo: Amazon’s Highly Available Key-value Store ● Cassandra - A Decentralized Structured Storage System [annotated by Jonathan Ellis] ● Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems ● Calvin: Fast Distributed Transactions for Partitioned Database Systems ● Spanner: Google's Globally-Distributed Database
  • 75. © DataStax, All Rights Reserved. Thank you