Cassandra Internals Overview

CASSANDRA
INTERNALS OVERVIEW
DATASTAX BOOTCAMP 2015
Sam Tunnicliffe
sam@datastax.com / @beobal

OVERVIEW
System startup
Messaging
Gossip
Schema Propagation
Request Coordination

STARTUP
org.apache.cassandra.service.CassandraDaemon
protected void setup()
Load config
Run preflight checks
Load schema
Clean up local temporary state
Recover CommitLog
Schedule background compactions
Initialize storage service

PREFLIGHT CHECKS
Sane clock
JNI
JVM & Instrumentation
Filesystem permissions
System keyspace status
Upgrades (#8049)
Incompatible SSTables (#8049)

STARTUP
org.apache.cassandra.service.CassandraDaemon
protected void setup()
Load config
Run pre-flight checks
Load schema
Recover CommitLog

CLEAN UP LOCAL STATE
Truncate compactions_in_progress
Scrub data directories

STARTUP
org.apache.cassandra.db.commitlog.CommitLog
public int recover() throws IOException
Load config
Run pre-flight checks
Load schema
Recover CommitLog

INITIALIZE STORAGE SERVICE
org.apache.cassandra.service.StorageService
public synchronized void initServer() throws ConfigurationException
Load ring state (unless don't)
Start gossip & get initial ring info
Set tokens

BOOTSTRAP
Abort if other range movements happening
Fetch bootstrap data
Build secondary indexes

INITIALIZE STORAGE SERVICE
Load ring state (unless don't)
Start gossip & get initial ring info
Set tokens
Setup auth resources
Ensure gossip stabilized

STARTUP
Load config
Run preflight checks
Load schema
Recover CommitLog

MESSAGINGSERVICE
org.apache.cassandra.net.MessagingService
Low level one-way messaging
public void sendOneWay(MessageOut message, InetAddress to)
Async Request/Response
public int sendRR(MessageOut message, InetAddress to, IAsyncCallback cb)

MESSAGINGSERVICE
org.apache.cassandra.net.MessagingService
Reads
public int sendRRWithFailure(MessageOut message,
                             InetAddress to,
                             IAsyncCallbackWithFailure cb)
Writes
public int sendRR(MessageOut<? extends IMutation> message,
                  InetAddress to,
                  AbstractWriteResponseHandler handler,
                  boolean allowHints)

MESSAGINGSERVICE
Pre-emptively drops messages when overwhelmed
Dropped if time at execution > send time + timeout
Timeout value dependant on message type
Most client-initated requests can be dropped
(see MessagingService.DROPPABLE_VERBS)

GOSSIP
What it does do:
Disseminates members' state around the cluster
Versioned: generation (per JVM) & version (per value)
Heartbeats: incremented every gossip round
Application state:
Status
Tokens
Release & schema version
DC & Rack
Addresses
Data size
Health

GOSSIP
What doesn't it do:
Notify about up or down nodes
Propagate schema
Transmit data files
Distribute mutations

GOSSIP
https://wiki.apache.org/cassandra/ArchitectureGossip

GOSSIP
org.apache.cassandra.gms.Gossiper
private class GossipTask implements Runnable
{
public void run()
{...
Each round (1 second) gossip to:
1 live endpoint
maybe 1 unreachable endpoint
maybe 1 seed - if neither of the above

SCHEMA MIGRATION
Another custom protocol
Also uses MessagingService
Target schema objects serialized as Mutations
diff/merge schema representations

SCHEMA PUSH
org.apache.cassandra.service.MigrationManager
private static Future<?> announce(final Collection<Mutation> schema)

SCHEMA PULL
org.apache.cassandra.service.MigrationManager
public void scheduleSchemaPull(InetAddress endpoint, EndpointState state)

Client request arrives at coordinator:
COORDINATION
Transformed into actionable command(s):
IReadCommand
IMutation
Coordinator distributes execution around the cluster
Replicas perform commands and respond to coordinator
Gather responses and determine client response

COORDINATION
org.apache.cassandra.service
StorageProxy
AbstractWriteResponseHandler
AbstractReadExecutor
org.apache.cassandra.locator
AbstractReplicationStrategy
IEndpointSnitch

https://wiki.apache.org/cassandra/ArchitectureInternals
COORDINATING WRITES
org.apache.cassandra.service.StorageProxy
public static void mutate(Collection<? extends IMutation> mutations,
ConsistencyLevel consistency_level)
Get endpoints using replication strategy
Get pending endpoints from ring metadata
Deliver mutations to both sets of endpoints
Collate responses & determine client response
Maybe store local hints for unreachable replicas

DATA REPLICATION
org.apache.cassandra.locator.SimpleStrategy

DATA REPLICATION
org.apache.cassandra.locator.NetworkTopologyStrategy

DELIVERING MUTATIONS
public static void sendToHintedEndpoints(final Mutation mutation,
                                         Iterable<InetAddress> targets,
                                         AbstractWriteResponseHandler responseHandler,
                                         String localDataCenter)
Mutations sent to replicas using MessagingService
ResponseHandler registered as callback
Callback registry triggers an event on expiry
Sent directly within local datacenter
Forwarded via single node in each remote DC

COORDINATING WRITES
public static void mutate(Collection<? extends IMutation> mutations,
Get endpoints using replication strategy
Get pending endpoints from ring metadata
Deliver mutations to both sets of endpoints
Collate responses & determine client response
Maybe store local hints for unreachable replicas

HINTS
Nodes can be down
Writes may timeout
In which case we may hint
Enabled/disabled globally or enabled per-DC
Writing a hint counts towards ConsistencyLevel.ANY
Deliver hints when a node comes back up & periodically
Too many hints in progress for a replica means we bail early

Determine point of failure by WriteType
LOGGED BATCHES
public static void mutateAtomically(Collection<Mutation> mutations,
CommitLog for batches
Guarantee eventual success of batched statements
Strives to distribute to across racks in local DC
On success, cleanup log entries asynchronously
Failed batches replayed by the nodes holding the logs
WriteType.BATCH_LOG
WriteType.BATCH

COORDINATING READS
public static List<Row> read(List<ReadCommand> commands,
ConsistencyLevel consistencyLevel,
ClientState state)
Partition based reads
Read Repair & Data vs Digest Requests
Rapid Read Protection & (non)speculating executors
Distribution is more slightly complex than for writes

IDENTIFY TARGET ENDPOINTS
org.apache.cassandra.service.AbstractReadExecutor
public static AbstractReadExecutor getReadExecutor(ReadCommand command,
ConsistencyLevel consistencyLevel)
Use replication strategy to get live endpoints
Snitch sorts by proximity & health of replicas
Consult table metadata for Read Repair Decision

READ REPAIR DECISION
Apply filter to sorted list of all live replicas
NONE: closest n replicas required by CL
GLOBAL: all live replicas
DC_LOCAL: all local replicas
Add closest n remotes needed to satisfy CL
Default Global Chance: 0.0
Default Local Chance: 0.1
Give us a list of replicas to send read requests

RAPID READ PROTECTION
Never
Always
Fixed timeout
Table latency percentile

LIGHTS, CAMERA, EXECUTION
Fire off each command using read executor
Requests are sent via MessagingService
Closest replica(s) sent full data requests
Others get digest requests

RESOLUTION
Resolution can have two outcomes:

RESOLUTION
DigestMismatchException
Trigger a foreground read repair
Of all targetted replicas

FOREGROUND READ REPAIR
All data requests, no digests
Includes replicas contacted initially
Effectively ConsistencyLevel.ALL
Specialized resolver: RowDataResolver
Retry any short reads
May also perform background Read Repair

Cassandra Internals Overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cassandra Internals Overview

Similar to Cassandra Internals Overview (20)

Recently uploaded

Recently uploaded (20)

Cassandra Internals Overview