TechEvent Apache Cassandra

BASLE BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA
HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Apache Cassandra
Under The Hood
Robert Bialek

Who Am I
Apache Cassandra Under The Hood2 15.09.2018
Senior Principal Consultant and Trainer at Trivadis GmbH in Munich.
– Master of Science in Computer Engineering.
– At Trivadis since 2004.
– Trivadis Partner since 2012.
Focus:
– Data and service high availability, disaster recovery.
– Architecture design, optimization, automation.
– Troubleshooting.
– Trainer: O-RAC, O-DG.

Agenda
1. Introduction
2. Key Components
3. Data Replication
4. Scalability
5. Read/Write Operations
6. Data Consistency
7. Summary

Introduction

What is Apache Cassandra?
Distributed NoSQL (wide column) partitioned row store database, which runs within a
JVM.
Decentralized, highly fault tolerant database with no single point of failure.
Horizontal scalable system (computing resources/performance).
Initially developed at Facebook, released as an open source project in July 2008.
– Based on Amazon‘s Dynamo and Google‘s Big Table.

Apache Cassandra & CAP Theorem?
According to CAP (Brewer’s) theorem “it is impossible for a distributed data store to
simultaneously provide more than two out of the following three guarantees”
– Consistency
– Availability
– Partition tolerance
Apache Cassandra is a AP system.
– Data result is eventually consistent (though, consistency is tunable).
– Does not adhere to all ACID properties.
? ?

Cassandra for Enterprise Applications
Support 24x7x365.
Enterprise features, e.g.: DSE Advanced Security, DSE Analytics, DSE Search, DSE
Graph, DSE Advanced Replication, DSE Tiered Storage, DSE NodeSync, ...
Administration and monitoring with DSE OpsCenter (real-time monitoring, tuning,
provisioning, backup, security management).
According to DataStax, 2x or more throughput compared to Apache Cassandra.
Documentation, client drivers and DSE for development are free to use.

Who is Using Cassandra Database?
Source http://cassandra.apache.org
– Apple: over 75,000 nodes storing over 10 PB of data.
– Netflix: 2,500 nodes, 420 TB, over 1 trillion requests per day.
– Chinese search engine Easou: 270 nodes, 300 TB, over 800 million requests per
day.
– eBay: over 100 nodes, 250 TB.
Source https://www.datastax.com/customers
– Microsoft, UBS, Sony, Sky, ING, NEC, Coursera, CISCO, Walmart, NVIDIA,
Samsung, …

Key Components

Node – Basic Database Infrastructure
Commodity hardware, ideally local storage (reduce
dependencies).
Hosts software and configuration files:
– cassandra.yaml, cassandra-rackdc.properties, …
Hosts data and accompanying structures:
Cassandra Node
(DSE: Transactional Node)
Index.db
Data.db
(SSTable) Statistics.db
CompressionInfo.db
Digest.crc32
Filter.db
TOC.txt

Keyspaces & Tables
Table (Column Family)
– Stores data based on a primary key.
• Primary key: partitioning key plus optionally
clustering columns.
– Physically split into partitioned.
– Denormalization (data duplication) is necessary.
Keyspace
– Grouping of data, similar to a schema.
– Defines replication properties.

Partitioner – Data Distribution
Determines which node receives data based on
partitioning key token.
Supplied partitioners (own can be created):
Data
Token
PARTITIONER
Murmur3Partitioner (default)
Random Partitioner
ByteOrderedPartitioner
‘Cassandra'
356242581507269238

Cassandra Ring – Singe Token Architecture
Cassandra Ring
initial_token:1
initial_token:10
initial_token: 20
initial_token: 30
Example Partitioner
Token Range: 1 – 40
Token Range: 31 – 40,1
Data
Token

Cassandra Ring – Virtual Nodes Architecture
Cassandra Ring
Example Partitioner
num_tokens: 5
Token Ranges: 1 – 2, 11 – 12,
21 – 22, 33 – 34, 39 – 40
num_tokens: 5
Token Ranges: 3 – 4, 9 – 10,
29 – 30, 23 – 24, 39 – 40
num_tokens: 5
Token Ranges: 7 – 8, 17 – 18,
27 – 28, 31 – 32, 37 – 38
num_tokens: 5
Token Ranges: 5 – 6, 15 – 16,
19 – 20, 25 – 26, 35 – 36
Data
Token
Partitioner

Snitches – Ring Topology
Determines physical location (datacenter and a
rack) of a Cassandra node.
Dynamic snitching (enabled by default):
– Monitors the read performance and ring health.
SNITECHES
SimpleSnitch/DseSimpleSnitch (default)
GossipingPropertyFileSnitch
PropertyFileSnitch
Ec2Snitch/Ec2MultiRegionSnitch/GoogleCloudSnitch/
CloudstackSnitch
RackInferringSnitch
DC 1 DC 2
Rack 1 Rack 1
Rack 2 Rack 2

Gossip – Internode Communication
Peer-to-peer communication protocol to exchange
ring state information.
Gossip process runs every second and exchanges
messages with up to three other nodes in the ring.
Eventually, all nodes learn (indirectly) about all
other nodes.

Scalability

Cassandra Ring – Scale Out
Increases computing power and
throughput of a Cassandra ring.
Online and transparent to the
applications.
Ring
Information
START
Joing Ring
Generate
Tokens
FINISH
Joing Ring
Cassandra Ring
SEED Node
Bootstrap
Data Streaming
Software &
Configuration Files

Cassandra Ring – Scale In
Decreases computing power of a
Cassandra ring.
Online and transparent to the
applications.
Cassandra Ring
DECOMMISSION
Data Streaming
Remove
Tokens
DECOMMISSIONED

Data Replication

Replication – Data High Availability
To ensure data and service high availability, Cassandra stores data on multiple
nodes in a cluster.
All replicas are equally important (no primary or
secondary data).
Replication strategy and replication factor (RF) is
defined on a keyspace (application) level.
– RF can be set differently in different data centers.
Two replication strategies are available:
– SimpleStrategy
– NetworkTopologyStrategy
DC 1 DC 2
Rack 1 Rack 1
Rack 2 Rack 2

Replication – SimpleStrategy (RF: 2)
Data Center 1
Rack 1 Rack 1
Rack 1 Rack 1

Replication – NetworkTopologyStrategy (RF/DC: 2)
Data Center 1 Data Center 2
Rack 1 Rack 1
Rack 2 Rack 2

Read/Write Operations

Read Request Flow on a Cassandra Node
Memtable Row Cache Bloom Filter
Partition Key
Cache
Compression
Offset Map Partition Summary
Partition Index
SSTables
MemoryDisk

Write Request Flow on a Cassandra Node
Memtable
Index.db
Data.db
(SSTable)
MemoryDisk
Commit Log
Statistics.db
CompressionInfo.db
Digest.crc32
Filter.db
TOC.txt
Compaction
Process

Upserts on a Cassandra Node
Memtable
TAG: CASSANDRA
SSTables
ID C1 C2 TSTAMP
1 2 TEST1 100
ID C1 C2 TSTAMP
2 3 TEST2 50
INSERT INTO t (TAG, ID,C1,C2)
VALUES (‘CASSANDRA‘,1,5,‘TEST3‘);
UPDATE t SET C2=PROD1 WHERE
TAG=‘CASSANDRA‘ AND ID=1;
DELTE FROM t
WHERE TAG=‘CASSANDRA‘ AND ID=2;
ID C1 C2 TSTAMP
1 5 TEST3 150
ID C2 TSTAMP
1 PROD1 200
ID Tombstone
(marked_deleted)
TSTAMP
2 250
Partition Key: TAG
Primary Key: TAG, ID

Compaction Process on a Cassandra Node
ID C1 C2 TSTAMP
1 2 TEST1 100
ID C1 C2 TSTAMP
2 3 TEST2 50
ID C1 C2 TSTAMP
1 5 TEST3 150
ID C2 TSTAMP
1 PROD1 200
ID Tombstone
(marked_deleted)
TSTAMP
2 250
ID C1 C2 TSTAMP
3 4 TEST3 120
ID C1 C2 TSTAMP
1 5 PROD1 300
ID Tombstone
(marked_deleted)
TSTAMP
2 250
ID C1 C2 TSTAMP
3 4 TEST3 120
gc_grace_seconds
reached?
New SSTable
Compaction Strategies
SizeTieredCompactionStrategy (STCS)
LeveledCompactionStrategy (LCS)
TimeWindowCompactionStrategy (TWCS)
No

Data Consistency

Data Consistency – Overview
Cassandra offers tunable data consistency for read and write operations.
Two types of read requests:
– Direct read request.
– Digest read request.
Inconsistent data can be repaired automatically by:
– Background read repair request.
– NodeSync continuous background repair (only DSE 6).
Inconsistent data can be repaired manually by:
– Anty-Entropy Repair.

Tunable Consistency
A tradeoff between data consistency and availability
WRITE Consistency Level READ Consistency Level
ALL ALL
EACH_QUORUM Not supported.
QUORUM QUORUM
LOCAL_QUORUM LOCAL_QUORUM
ONE, TWO, THREE ONE, TWO, THREE
LOCAL_ONE LOCAL_ONE
ANY Not supported.
Not supported. SERIAL
Not supported. LOCAL_SERIAL

Read Requests & Tunable Consistency (1)
One DC, CONSISTENCY=QUORUM, RF=3
Coordinator
Direct Read
Digest Read
speculative_retry!

Coordinator
Direct Read
Digest Read Background
Read Repair
read_repair_chance=0.10

Two DC, CONSISTENCY=QUORUM, RF=3
Coordinator
DC=1 DC=2
Direct Read
Digest Read
Digest Read
Digest Read

Two DC, CONSISTENCY=LOCAL_QUORUM, RF=3
Coordinator
DC=1 DC=2
Direct Read
Digest Read

Write Requests & Tunable Consistency (1)
One DC, CONSISTENCY=ONE, RF=3
Coordinator

Write Requests & Tunable Consistency (2)
Coordinator
DELETE
Possibile
ZOMBI
Hinted Handoff

Data Consistency – Anty-Entropy Repair
Manual data repair:
– A Merkle tree is build for each replica
– Merkle trees are compered between all
replicas.
Repair can be performed:
– Sequential.
– Parallel.
– Datacenter parallel.
Source: DSE 6.0 Architecture Guide

Summary

Summary
Cassandra is a very powerful distributed and decentralized NoSQL database with no
singe point of failure.
It guarantees service and data availability in case of a partitioned network, though
the data might be stale.
Designed for large data stores which require performant and scalable system.
Application data model need to be designed for Cassandra.
Many ways to interact with the database:
– CQLSH (Cassandra Query Language Shell).
– Drivers and tools provided by DataStax.
DataStax offers support for enterprise customers and a good documentation.

15.09.2018 Apache Cassandra Under The Hood41
Robert Bialek
Senior Principal Consultant
Tel. +49 89 99 27 59 38
robert.bialek@trivadis.com

TechEvent Apache Cassandra

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to TechEvent Apache Cassandra

Similar to TechEvent Apache Cassandra (20)

More from Trivadis

More from Trivadis (20)

Recently uploaded

Recently uploaded (20)

TechEvent Apache Cassandra