SlideShare a Scribd company logo
Cassandra by example - the path of read and
write requests

Abstract

This article describes how Cassandra handles and processes requests. It will help you to get a better
impression about Cassandra's internals and architecture. The path of a single read request as well as
the path of a single write request will be described in detail. This description is based on a single data
center Cassandra V1.1.4 cluster (default store configuration).


Example data model

Please consider that this article is not an introduction to the Cassandra model. In the examples below
a column family hotel is used. In short, a column family is analogous to tables of the relational
database approach. Each hotel record or row is identified by a unique key. The columns of a hotel
row include the hotel name as well as the category of the hotel.

The column family hotel lives inside the keyspace book_a_hotel . A keyspace can be described by
analogy as a tablespace or database.




Thrift

The common way to access Cassandra is using Thrift. Thrift is a language-independent RPC protocol
originally developed at Facebook and contributed to Apache. Although Thrift is widely supported by
the most popular programming languages the Cassandra project suggests using higher level
Cassandra clients such as Hector or Astyanax instead the raw Thrift-based API. In general these high
level clients try to hide the underlying middleware protocol.



Gregor Roth           Cassandra by example - the path of read and write requests                             1
The listing below shows a simple query by using the Hector client library V1.1.
// [1] prepare the client (cluster)
Cluster cluster = HFactory.getOrCreateCluster("TestClstr", "172.39.126.14, 172.39.126.93, 172.39.126.52");
Keyspace keyspaceOperator = HFactory.createKeyspace("book_a_hotel", cluster);


// [2] create the query (fetching the column category)
SliceQuery<String, String, String> query = HFactory.createSliceQuery(keyspaceOperator,
AsciiSerializer.get(), StringSerializer.get(), StringSerializer.get());
query.setColumnFamily("hotel");
query.setKey("26813445");
query.setColumnNames("category");


// [3] perform the request
QueryResult<ColumnSlice<String, String>> result = query.execute();
ColumnSlice<String, String> row = result.get();
String category = row.getColumnByName("category").getValue();
//...


// [4] release the client (cluster)
cluster.getConnectionManager().shutdown();


In the first line of the listing a set of server IP addresses is passed over by creating a Hector Cluster
object. The server address identifies a single Cassandra node. A collection of independent Cassandra
nodes (the Cassandra cluster) represents the Cassandra database. Within this cluster all nodes are
peers. No master node or something like that exists.

The client is free to connect any Cassandra node to perform any request. In the listing above 3
addresses are configured. This does not mean that the Cassandra cluster consist of 3 nodes. It just
defines that the client will communicate with these nodes only.

The connected Cassandra node plays two roles, potentially. In each case the connected node is the
coordinator node which is responsible to handle the dedicated request. Furthermore the connected
node will be a replica store node, if the node is responsible to store a replica of the requested data.
For instance the requested Pavillon Nation hotel record of the example above does not have to be
stored on the connected node. Often the coordinator node has to send sub requests to other replica
nodes to be able to handle the request. As shown in the diagram below the notes 172.39.126.14,
172.39.126.93 and 172.39.126.52 would not able to serve a Pavillon Nation query in a direct way
without sub requesting other nodes.




Gregor Roth           Cassandra by example - the path of read and write requests                             2
Please consider that a coordinator node and a replica node is a role description of a Cassandra node
in context of a dedicated read or write operation. All Cassandra nodes can be a coordinator node as
well as a replica node.

Hector uses a round-robin strategy to select the node to use. By executing the example query Hector
first connects one of the configured nodes. The connect request will be handled on the server-side by
the CassandraServer .




By default the CassandraServer is bound to server port 9160 during the start sequence of a
Cassandra node. The CassandraServer implements Cassandra's Thrift interface which defines remote
procedure methods such as set_keyspace(…) or get_slice(…).This meansCassandra's Thrift interface is
stateful, implicitly. The Hector client has to call the remote method set_keyspace(..) first to assign the
keyspace book_a_hotel to the current connection session. After assigning the keyspace the
get_slice(..) can be called to request the columns of the Pavillon Nation hotel.

However, you are not forced to use Thrift to access Cassandra. Several alternative open-source
connectors such as REST-based connectors exist.


Determining the replica nodes

The CassandraServer is responsible to handle the client-server communication only. Internally, the
CassandraServer calls the local StorageProxy class to process the request. The StorageProxy
implements the coordinator logic. The coordinator logic includes determining the replica notes for
the request row key as well as requesting these replica nodes.

By default a RandomPartitioner is used to determine the replica nodes for the row key of the
request. The RandomPartinitoner spreads the data records (rows) evenly across the Cassandra nodes
which are arranged in a circular ring. Within this ring each node is assigned to a range of hash values
(tokens). To determine the first replica, the MD5 hash of the row key will be calculated and the node
will be selected where the key hash maps with the assigned token range.


Gregor Roth           Cassandra by example - the path of read and write requests                             3
For instance the token of the Pavillon Nation's row key 26813445 is
91851936251452796391746312281860607309. This token is within the token range of node
172.39.126.86 which means that node 172.39.126.86 is responsible to store a replica of the Pavillon
Nation record.




In most case a replica is stored by more than one node which depends on the key space's replication
factor. For instance a replication factor 2 means the clockwise next node of the ring will store the
replica, too. If replication level is 3, the next of the next will also store the replica and so forth.


Processing a read request

The handle a read request the StorageProxy (which is the coordinator of the request) determines the
replica nodes as described above. Additionally, the StorageProxy checks that enough replica nodes
are alive to handle the read request. If this is true, the replica nodes will be sorted by proximity
(closest node first) and the first replica node will be called to get the requested row data.

In contrast to the thrift-based client-server communication the Cassandra nodes interchange data by
using a message-oriented tcp-based protocol. This means the StorageProxy will get the requested
row data by using Cassandra's messaging protocol.

Calling other replica nodes depends on the consistency level. The consistency level is specified by the
client request. If consistency level ONE is required, no further replica nodes will be called. If
consistency level QUORUM is required, in total (replication_factor / 2) + 1 replica nodes
will be called.

In contrast to the first full-data read call all additional calls are digest calls. A digest call queries a
single MD5 hash of all column names, values and timestamps instead requesting the complete row
data. The hashes of all calls, including the first one will be compared together. If a hash does not
match, the replicas will be inconsistent and the out-of-date replicas will be auto-repaired during the


Gregor Roth           Cassandra by example - the path of read and write requests                              4
read process. To do this, a full-data read request will be sent to the additional nodes, the most recent
version of data will be computed and the diff will be sent to out-of-date replicas.

Occasionally all replica nodes for the row key will be called independent of the requested consistency
level. This depends on the column family's read_repair_chance parameter. This configuration
parameter specifies the probability with which read repairs should be invoked. The default value of
0.1 means that a read repair is performed 10%. However, the client response will always be
answered regarding to the requested consistency level. Additional work will be done in background.
A read_repair_chance parameter larger the 0 ensures that frequently read data remains consistent
even though only consistency level ONE is required. The row becomes consistent eventually.


Performing the local data query

As already mentioned above, a dedicated messaging protocol is used for inter-node communication.
Similar to the CassandraServer the MessagingService will be started during the start sequence of a
Cassandra node, too. By default the MessagingService in bound to server port 7000.

The replica node will receive the read call from the coordinator node through the replica node's
MessagingService. However, the MessagingService will not access the local store in a direct way. To
read and write data locally, the ColumnFamilyStore has to be used. Roughly speaking, the
ColumFamilyStore represents the underlying local store of a dedicated column family.




Please consider that a coordinator node can also be in role replica node. This will be true, if the client
calls node 172.39.126.52 to get the Mister bed city row instead of the Pavillon Nation row in the
example above. In this case the StorageProxy of the coordinator node will not call the


Gregor Roth           Cassandra by example - the path of read and write requests                             5
MessagingService of the same node. To avoid remote calls to the same node, the StorageProxy will
call the ColumnFamilyStore in the same way the MessagingServices does to access local data.

By processing a query the ColumnFamilyStore will try to read the requested row data through the
row cache, if the row cache is activated for the column family. The row cache holds the entire row
and is deactivated per default. If the row cache contains the requests row data, no disk IO will be
required. The query will be served very fast by performing in-memory operations only. However, an
activated row cache causes that the full row have to be fetched internally even though a sub set of
columns is requested. For this reasons the row cache is often less efficient for large rows and small
sub set queries.

If the request row isn't cached, the Memtables and the SSTables (sorted strings table) have to be
read. Memtables and SSTables are maintained per column family. SSTables are data files containing
row data fragments and only allow appending data. A Memtable is an in-memory table which buffers
writes. If the Memtable is full, it will be written to disk as a new SSTable file in background. For this
reason the columns of the requested Pavillon Nation row could be fragmented over several SSTables
and unflushed Memtables. For instance one SSTable book_a_hotel-hotel-he-1-Data.db could contain
the initial inserted columns ‘name’= ‘Pavillon Nation’ and ‘category’=’4’ of the Pavillon Nation row.
Another SSTable book_a_hotel-hotel-he-2-Data.db (or Memtable) could contain the updated
category column ‘category’=’5’.




If an SSTable exists for the requested column family, first the associated (key-scoped) Bloom filter of
the SSTable file will be read to avoid unnecessary disk IO. For each SSTable the ColumnFamilyStore
holds an in-memory structure called SSTableReader which contains metadata as wells as the Bloom
filter of the underlying SSTable file. The Bloom filter indicates that the dedicated SSTable could
contain a row data fragment (false positive are possible, false negative not). If this is true, the key
cache will be requested to get the seek position. If not found, the on-disk index will have to be
scanned. The fetched seek position will be added to the key cache in this case. Based on the seek
position the row data fragment will be read from the SSTable file. The data fragments of the SSTables
and Memtables will be merged together by using the column timestamp and the requested row data
will be returned to the caller.




Gregor Roth          Cassandra by example - the path of read and write requests                             6
Processing an write request

To insert, update or delete a row Cassandra's mutate method has to be called. The listing below
shows such a mutate call by using the Hector client.

//...


// [1.b] create and perform an update
Mutator<String> mutator = HFactory.createMutator(keyspaceOperator, AsciiSerializer.get());
mutator.addInsertion("26813445", "hotel",
           HFactory.createColumn("category", "5", StringSerializer.get(), StringSerializer.get()));

MutationResult result = mutator.execute();

//...



The write path is very the same to the read path. Similar to the read request a write request also
includes the required consistency level. However, the coordinator node tries to send a write request
including the mutated columns to all replica nodes for the row key.

First, the StorageProxy of the coordinator node checks if enough replica notes for the row key are
alive regarding to the requested consistency level. If this is true, the write request will be sent to the
living replica nodes. If not, an error response will be returned. Write requests to temporarily failed
replica nodes will be scheduled as a hinted handoff. This means that a hint will be written locally
instead calling the failed node. Once the failed replica node is back the hint will be sent to this node
to perform the write operation. By sending the hints the failed nodes becomes consistent to the
other nodes. Please consider that hints will not longer store locally, if the failed node is dead longer
than 1 hour (config param max_hint_window_in_ms).

The coordinator node returns the response to the client as soon as the replica nodes conforming to
the consistency level have confirmed the update (a hinted write will not count towards the
requested consistency level). The updates of the other replica nodes will still be executed in
background. If an error occurs by updating the replica nodes conforming to the consistency level, an
error response will be returned. However, in this case the already updated nodes will not be
reverted. Cassandra does not support distributed transactions, and hence it does not support a
distributed rollback.

The write operation supports an additional consistency level ANY which means that the mutated
columns have to be written to at least one node regardless of whether this node is a replica node for
the key or not. In contrast to consistency level ONE the write will also succeed, if a hinted handoff is
written (by the coordinator node). However, in this case the mutated columns will not be readable
until the responsible replica nodes have recovered.




Gregor Roth           Cassandra by example - the path of read and write requests                             7
Performing the local update

Similar to the local data query a local update is triggered by handling a message through the
MessagingService or by the StorageProxy. However, in contrast to the read path, first a commit log
entry will be written for durability reasons. By default the commit log entry will be written in
background asynchronously.

The mutated columns will also be written into the in-memory Memtable of the column family. After
inserting the changes the local update is completed.

However, the memory size of a Memtable is limited. If the max size is exceeded, the Memtable will
be written to disk as a new SSTable. This is done by a background thread which checks the current
size of all unflushed Memtables of all ColumnFamilies, periodically. If a Memtable exceeds the max
size, the background thread replaces the current Memtable by a new one. The old Memtable will be
marked as pending flush and will be flushed by another thread. Under certain circumstances several
pending Memtables for a column family could exists. After writing the Memtable to disk a new
SSTableReader referring the written SSTable is created and added to the ColumnFamilyStore. Once
written, the SSTable file is immutable. By default the SSTable data will be compressed
(SnappyCompression).

Compacting

The SSTable file includes the modified columns of the row including their timestamps as well as
additional row meta data. For instance the meta data section includes a (column name-scoped)
Bloom Filter which is used to reduce disk IO by fetching columns by name.




To reduce fragmentation and save space, SSTable files will be merged into a new SSTable file,
occasionally. This compaction will be triggered by a background thread, if the compaction threshold
is exceeded. The compaction threshold can be set for each column family.

Gregor Roth         Cassandra by example - the path of read and write requests                        8
About the author

Gregor Roth works as a software architect at United Internet group, a leading European Internet
Service Provider to which GMX, 1&1, and Web.de belong. His areas of interest include software and
system architecture, enterprise architecture management, distributed computing, and development
methodologies.




Gregor Roth         Cassandra by example - the path of read and write requests                      9

More Related Content

What's hot

Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
ScyllaDB
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
Patrick McFadin
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
Rahul Jain
 
Cassandra presentation at NoSQL
Cassandra presentation at NoSQLCassandra presentation at NoSQL
Cassandra presentation at NoSQL
Evan Weaver
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
Treasure Data, Inc.
 
kafka
kafkakafka
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
DataStax
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
Dave Gardner
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
ScyllaDB
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
Altinity Ltd
 
Understanding How CQL3 Maps to Cassandra's Internal Data Structure
Understanding How CQL3 Maps to Cassandra's Internal Data StructureUnderstanding How CQL3 Maps to Cassandra's Internal Data Structure
Understanding How CQL3 Maps to Cassandra's Internal Data Structure
DataStax
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars George
JAX London
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loadingalex_araujo
 
Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016
Markus Höfer
 
(STG402) Amazon EBS Deep Dive
(STG402) Amazon EBS Deep Dive(STG402) Amazon EBS Deep Dive
(STG402) Amazon EBS Deep Dive
Amazon Web Services
 
Tuning Apache Phoenix/HBase
Tuning Apache Phoenix/HBaseTuning Apache Phoenix/HBase
Tuning Apache Phoenix/HBase
Anil Gupta
 
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and BeyondScylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
ScyllaDB
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
confluent
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
ScyllaDB
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
Lars Hofhansl
 

What's hot (20)

Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
 
Cassandra presentation at NoSQL
Cassandra presentation at NoSQLCassandra presentation at NoSQL
Cassandra presentation at NoSQL
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
kafka
kafkakafka
kafka
 
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
 
Understanding How CQL3 Maps to Cassandra's Internal Data Structure
Understanding How CQL3 Maps to Cassandra's Internal Data StructureUnderstanding How CQL3 Maps to Cassandra's Internal Data Structure
Understanding How CQL3 Maps to Cassandra's Internal Data Structure
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars George
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
 
Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016
 
(STG402) Amazon EBS Deep Dive
(STG402) Amazon EBS Deep Dive(STG402) Amazon EBS Deep Dive
(STG402) Amazon EBS Deep Dive
 
Tuning Apache Phoenix/HBase
Tuning Apache Phoenix/HBaseTuning Apache Phoenix/HBase
Tuning Apache Phoenix/HBase
 
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and BeyondScylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
 

Viewers also liked

Cassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write pathCassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write path
Joshua McKenzie
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
Adrian Cockcroft
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
DataStax
 
Cassandra @Formspring
Cassandra @FormspringCassandra @Formspring
Cassandra @Formspringmartincozzi
 
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at RackspaceHadoop and Cassandra at Rackspace
Hadoop and Cassandra at RackspaceStu Hood
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalability
jbellis
 
From 100s to 100s of Millions
From 100s to 100s of MillionsFrom 100s to 100s of Millions
From 100s to 100s of Millions
Erik Onnen
 
Camunda and Apache Cassandra
Camunda and Apache CassandraCamunda and Apache Cassandra
Camunda and Apache Cassandra
camunda services GmbH
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
 
Management Consulting
Management ConsultingManagement Consulting
Management Consulting
Alexandros Chatzopoulos
 
Reverse Engineering
Reverse EngineeringReverse Engineering
Reverse Engineeringdswanson
 
Chess
ChessChess
Chess
Chuck Vohs
 
Lionel Messi
Lionel MessiLionel Messi
Lionel Messi
NaliKardan
 
Lionel messi
Lionel messiLionel messi
Lionel messi
Dipanker Singh
 
Cassandra Internals: The Read Path (Tyler Hobbs, DataStax) | Cassandra Summit...
Cassandra Internals: The Read Path (Tyler Hobbs, DataStax) | Cassandra Summit...Cassandra Internals: The Read Path (Tyler Hobbs, DataStax) | Cassandra Summit...
Cassandra Internals: The Read Path (Tyler Hobbs, DataStax) | Cassandra Summit...
DataStax
 
Jeff jonas big data new physics
Jeff jonas big data new physicsJeff jonas big data new physics
Jeff jonas big data new physics
MIT Forum of Israel
 
Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandra
Patrick McFadin
 
Growth Hacking
Growth Hacking Growth Hacking

Viewers also liked (20)

Cassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write pathCassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write path
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
Cassandra @Formspring
Cassandra @FormspringCassandra @Formspring
Cassandra @Formspring
 
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at RackspaceHadoop and Cassandra at Rackspace
Hadoop and Cassandra at Rackspace
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalability
 
From 100s to 100s of Millions
From 100s to 100s of MillionsFrom 100s to 100s of Millions
From 100s to 100s of Millions
 
Camunda and Apache Cassandra
Camunda and Apache CassandraCamunda and Apache Cassandra
Camunda and Apache Cassandra
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Management Consulting
Management ConsultingManagement Consulting
Management Consulting
 
Oprah Winfrey
Oprah WinfreyOprah Winfrey
Oprah Winfrey
 
Reverse Engineering
Reverse EngineeringReverse Engineering
Reverse Engineering
 
Chess
ChessChess
Chess
 
Lionel Messi
Lionel MessiLionel Messi
Lionel Messi
 
Lionel messi
Lionel messiLionel messi
Lionel messi
 
Cassandra Internals: The Read Path (Tyler Hobbs, DataStax) | Cassandra Summit...
Cassandra Internals: The Read Path (Tyler Hobbs, DataStax) | Cassandra Summit...Cassandra Internals: The Read Path (Tyler Hobbs, DataStax) | Cassandra Summit...
Cassandra Internals: The Read Path (Tyler Hobbs, DataStax) | Cassandra Summit...
 
Jeff jonas big data new physics
Jeff jonas big data new physicsJeff jonas big data new physics
Jeff jonas big data new physics
 
Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandra
 
Growth Hacking
Growth Hacking Growth Hacking
Growth Hacking
 
Workshop
WorkshopWorkshop
Workshop
 

Similar to Cassandra by example - the path of read and write requests

Cassandra consistency
Cassandra consistencyCassandra consistency
Cassandra consistency
zqhxuyuan
 
Cassandra no sql ecosystem
Cassandra no sql ecosystemCassandra no sql ecosystem
Mongodb connection string
Mongodb connection stringMongodb connection string
Mongodb connection string
Pravin Dwiwedi
 
Querying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern FragmentsQuerying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern Fragments
Ruben Verborgh
 
Cassandra & Python - Springfield MO User Group
Cassandra & Python - Springfield MO User GroupCassandra & Python - Springfield MO User Group
Cassandra & Python - Springfield MO User Group
Adam Hutson
 
Internals of how an Http Client works (Final) (3).pdf
Internals of how an Http Client works (Final) (3).pdfInternals of how an Http Client works (Final) (3).pdf
Internals of how an Http Client works (Final) (3).pdf
jrhee17
 
White paper on cassandra
White paper on cassandraWhite paper on cassandra
White paper on cassandra
Navanit Katiyar
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
Nathan Milford
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
Alex Thompson
 
Spinnaker VLDB 2011
Spinnaker VLDB 2011Spinnaker VLDB 2011
Spinnaker VLDB 2011sandeep_tata
 
Dynamo.ppt
Dynamo.pptDynamo.ppt
Dynamo.ppt
ksjk1
 
Dynamo.ppt
Dynamo.pptDynamo.ppt
Dynamo.ppt
kaja56
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
aaronmorton
 
NoSql Database
NoSql DatabaseNoSql Database
NoSql Database
Suresh Parmar
 
Understanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache CassandraUnderstanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache Cassandra
DataStax
 
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraCassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Dave Gardner
 
Designing distributedsystems cht6
Designing distributedsystems cht6Designing distributedsystems cht6
Designing distributedsystems cht6
Chen-Tien Tsai
 
Introduction to Apache Cassandra and support within WSO2 Platform
Introduction to Apache Cassandra and support within WSO2 PlatformIntroduction to Apache Cassandra and support within WSO2 Platform
Introduction to Apache Cassandra and support within WSO2 PlatformSrinath Perera
 
Distribute Key Value Store
Distribute Key Value StoreDistribute Key Value Store
Distribute Key Value StoreSantal Li
 
Distribute key value_store
Distribute key value_storeDistribute key value_store
Distribute key value_storedrewz lin
 

Similar to Cassandra by example - the path of read and write requests (20)

Cassandra consistency
Cassandra consistencyCassandra consistency
Cassandra consistency
 
Cassandra no sql ecosystem
Cassandra no sql ecosystemCassandra no sql ecosystem
Cassandra no sql ecosystem
 
Mongodb connection string
Mongodb connection stringMongodb connection string
Mongodb connection string
 
Querying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern FragmentsQuerying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern Fragments
 
Cassandra & Python - Springfield MO User Group
Cassandra & Python - Springfield MO User GroupCassandra & Python - Springfield MO User Group
Cassandra & Python - Springfield MO User Group
 
Internals of how an Http Client works (Final) (3).pdf
Internals of how an Http Client works (Final) (3).pdfInternals of how an Http Client works (Final) (3).pdf
Internals of how an Http Client works (Final) (3).pdf
 
White paper on cassandra
White paper on cassandraWhite paper on cassandra
White paper on cassandra
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
 
Spinnaker VLDB 2011
Spinnaker VLDB 2011Spinnaker VLDB 2011
Spinnaker VLDB 2011
 
Dynamo.ppt
Dynamo.pptDynamo.ppt
Dynamo.ppt
 
Dynamo.ppt
Dynamo.pptDynamo.ppt
Dynamo.ppt
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
NoSql Database
NoSql DatabaseNoSql Database
NoSql Database
 
Understanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache CassandraUnderstanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache Cassandra
 
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraCassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache Cassandra
 
Designing distributedsystems cht6
Designing distributedsystems cht6Designing distributedsystems cht6
Designing distributedsystems cht6
 
Introduction to Apache Cassandra and support within WSO2 Platform
Introduction to Apache Cassandra and support within WSO2 PlatformIntroduction to Apache Cassandra and support within WSO2 Platform
Introduction to Apache Cassandra and support within WSO2 Platform
 
Distribute Key Value Store
Distribute Key Value StoreDistribute Key Value Store
Distribute Key Value Store
 
Distribute key value_store
Distribute key value_storeDistribute key value_store
Distribute key value_store
 

Recently uploaded

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 

Recently uploaded (20)

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 

Cassandra by example - the path of read and write requests

  • 1. Cassandra by example - the path of read and write requests Abstract This article describes how Cassandra handles and processes requests. It will help you to get a better impression about Cassandra's internals and architecture. The path of a single read request as well as the path of a single write request will be described in detail. This description is based on a single data center Cassandra V1.1.4 cluster (default store configuration). Example data model Please consider that this article is not an introduction to the Cassandra model. In the examples below a column family hotel is used. In short, a column family is analogous to tables of the relational database approach. Each hotel record or row is identified by a unique key. The columns of a hotel row include the hotel name as well as the category of the hotel. The column family hotel lives inside the keyspace book_a_hotel . A keyspace can be described by analogy as a tablespace or database. Thrift The common way to access Cassandra is using Thrift. Thrift is a language-independent RPC protocol originally developed at Facebook and contributed to Apache. Although Thrift is widely supported by the most popular programming languages the Cassandra project suggests using higher level Cassandra clients such as Hector or Astyanax instead the raw Thrift-based API. In general these high level clients try to hide the underlying middleware protocol. Gregor Roth Cassandra by example - the path of read and write requests 1
  • 2. The listing below shows a simple query by using the Hector client library V1.1. // [1] prepare the client (cluster) Cluster cluster = HFactory.getOrCreateCluster("TestClstr", "172.39.126.14, 172.39.126.93, 172.39.126.52"); Keyspace keyspaceOperator = HFactory.createKeyspace("book_a_hotel", cluster); // [2] create the query (fetching the column category) SliceQuery<String, String, String> query = HFactory.createSliceQuery(keyspaceOperator, AsciiSerializer.get(), StringSerializer.get(), StringSerializer.get()); query.setColumnFamily("hotel"); query.setKey("26813445"); query.setColumnNames("category"); // [3] perform the request QueryResult<ColumnSlice<String, String>> result = query.execute(); ColumnSlice<String, String> row = result.get(); String category = row.getColumnByName("category").getValue(); //... // [4] release the client (cluster) cluster.getConnectionManager().shutdown(); In the first line of the listing a set of server IP addresses is passed over by creating a Hector Cluster object. The server address identifies a single Cassandra node. A collection of independent Cassandra nodes (the Cassandra cluster) represents the Cassandra database. Within this cluster all nodes are peers. No master node or something like that exists. The client is free to connect any Cassandra node to perform any request. In the listing above 3 addresses are configured. This does not mean that the Cassandra cluster consist of 3 nodes. It just defines that the client will communicate with these nodes only. The connected Cassandra node plays two roles, potentially. In each case the connected node is the coordinator node which is responsible to handle the dedicated request. Furthermore the connected node will be a replica store node, if the node is responsible to store a replica of the requested data. For instance the requested Pavillon Nation hotel record of the example above does not have to be stored on the connected node. Often the coordinator node has to send sub requests to other replica nodes to be able to handle the request. As shown in the diagram below the notes 172.39.126.14, 172.39.126.93 and 172.39.126.52 would not able to serve a Pavillon Nation query in a direct way without sub requesting other nodes. Gregor Roth Cassandra by example - the path of read and write requests 2
  • 3. Please consider that a coordinator node and a replica node is a role description of a Cassandra node in context of a dedicated read or write operation. All Cassandra nodes can be a coordinator node as well as a replica node. Hector uses a round-robin strategy to select the node to use. By executing the example query Hector first connects one of the configured nodes. The connect request will be handled on the server-side by the CassandraServer . By default the CassandraServer is bound to server port 9160 during the start sequence of a Cassandra node. The CassandraServer implements Cassandra's Thrift interface which defines remote procedure methods such as set_keyspace(…) or get_slice(…).This meansCassandra's Thrift interface is stateful, implicitly. The Hector client has to call the remote method set_keyspace(..) first to assign the keyspace book_a_hotel to the current connection session. After assigning the keyspace the get_slice(..) can be called to request the columns of the Pavillon Nation hotel. However, you are not forced to use Thrift to access Cassandra. Several alternative open-source connectors such as REST-based connectors exist. Determining the replica nodes The CassandraServer is responsible to handle the client-server communication only. Internally, the CassandraServer calls the local StorageProxy class to process the request. The StorageProxy implements the coordinator logic. The coordinator logic includes determining the replica notes for the request row key as well as requesting these replica nodes. By default a RandomPartitioner is used to determine the replica nodes for the row key of the request. The RandomPartinitoner spreads the data records (rows) evenly across the Cassandra nodes which are arranged in a circular ring. Within this ring each node is assigned to a range of hash values (tokens). To determine the first replica, the MD5 hash of the row key will be calculated and the node will be selected where the key hash maps with the assigned token range. Gregor Roth Cassandra by example - the path of read and write requests 3
  • 4. For instance the token of the Pavillon Nation's row key 26813445 is 91851936251452796391746312281860607309. This token is within the token range of node 172.39.126.86 which means that node 172.39.126.86 is responsible to store a replica of the Pavillon Nation record. In most case a replica is stored by more than one node which depends on the key space's replication factor. For instance a replication factor 2 means the clockwise next node of the ring will store the replica, too. If replication level is 3, the next of the next will also store the replica and so forth. Processing a read request The handle a read request the StorageProxy (which is the coordinator of the request) determines the replica nodes as described above. Additionally, the StorageProxy checks that enough replica nodes are alive to handle the read request. If this is true, the replica nodes will be sorted by proximity (closest node first) and the first replica node will be called to get the requested row data. In contrast to the thrift-based client-server communication the Cassandra nodes interchange data by using a message-oriented tcp-based protocol. This means the StorageProxy will get the requested row data by using Cassandra's messaging protocol. Calling other replica nodes depends on the consistency level. The consistency level is specified by the client request. If consistency level ONE is required, no further replica nodes will be called. If consistency level QUORUM is required, in total (replication_factor / 2) + 1 replica nodes will be called. In contrast to the first full-data read call all additional calls are digest calls. A digest call queries a single MD5 hash of all column names, values and timestamps instead requesting the complete row data. The hashes of all calls, including the first one will be compared together. If a hash does not match, the replicas will be inconsistent and the out-of-date replicas will be auto-repaired during the Gregor Roth Cassandra by example - the path of read and write requests 4
  • 5. read process. To do this, a full-data read request will be sent to the additional nodes, the most recent version of data will be computed and the diff will be sent to out-of-date replicas. Occasionally all replica nodes for the row key will be called independent of the requested consistency level. This depends on the column family's read_repair_chance parameter. This configuration parameter specifies the probability with which read repairs should be invoked. The default value of 0.1 means that a read repair is performed 10%. However, the client response will always be answered regarding to the requested consistency level. Additional work will be done in background. A read_repair_chance parameter larger the 0 ensures that frequently read data remains consistent even though only consistency level ONE is required. The row becomes consistent eventually. Performing the local data query As already mentioned above, a dedicated messaging protocol is used for inter-node communication. Similar to the CassandraServer the MessagingService will be started during the start sequence of a Cassandra node, too. By default the MessagingService in bound to server port 7000. The replica node will receive the read call from the coordinator node through the replica node's MessagingService. However, the MessagingService will not access the local store in a direct way. To read and write data locally, the ColumnFamilyStore has to be used. Roughly speaking, the ColumFamilyStore represents the underlying local store of a dedicated column family. Please consider that a coordinator node can also be in role replica node. This will be true, if the client calls node 172.39.126.52 to get the Mister bed city row instead of the Pavillon Nation row in the example above. In this case the StorageProxy of the coordinator node will not call the Gregor Roth Cassandra by example - the path of read and write requests 5
  • 6. MessagingService of the same node. To avoid remote calls to the same node, the StorageProxy will call the ColumnFamilyStore in the same way the MessagingServices does to access local data. By processing a query the ColumnFamilyStore will try to read the requested row data through the row cache, if the row cache is activated for the column family. The row cache holds the entire row and is deactivated per default. If the row cache contains the requests row data, no disk IO will be required. The query will be served very fast by performing in-memory operations only. However, an activated row cache causes that the full row have to be fetched internally even though a sub set of columns is requested. For this reasons the row cache is often less efficient for large rows and small sub set queries. If the request row isn't cached, the Memtables and the SSTables (sorted strings table) have to be read. Memtables and SSTables are maintained per column family. SSTables are data files containing row data fragments and only allow appending data. A Memtable is an in-memory table which buffers writes. If the Memtable is full, it will be written to disk as a new SSTable file in background. For this reason the columns of the requested Pavillon Nation row could be fragmented over several SSTables and unflushed Memtables. For instance one SSTable book_a_hotel-hotel-he-1-Data.db could contain the initial inserted columns ‘name’= ‘Pavillon Nation’ and ‘category’=’4’ of the Pavillon Nation row. Another SSTable book_a_hotel-hotel-he-2-Data.db (or Memtable) could contain the updated category column ‘category’=’5’. If an SSTable exists for the requested column family, first the associated (key-scoped) Bloom filter of the SSTable file will be read to avoid unnecessary disk IO. For each SSTable the ColumnFamilyStore holds an in-memory structure called SSTableReader which contains metadata as wells as the Bloom filter of the underlying SSTable file. The Bloom filter indicates that the dedicated SSTable could contain a row data fragment (false positive are possible, false negative not). If this is true, the key cache will be requested to get the seek position. If not found, the on-disk index will have to be scanned. The fetched seek position will be added to the key cache in this case. Based on the seek position the row data fragment will be read from the SSTable file. The data fragments of the SSTables and Memtables will be merged together by using the column timestamp and the requested row data will be returned to the caller. Gregor Roth Cassandra by example - the path of read and write requests 6
  • 7. Processing an write request To insert, update or delete a row Cassandra's mutate method has to be called. The listing below shows such a mutate call by using the Hector client. //... // [1.b] create and perform an update Mutator<String> mutator = HFactory.createMutator(keyspaceOperator, AsciiSerializer.get()); mutator.addInsertion("26813445", "hotel", HFactory.createColumn("category", "5", StringSerializer.get(), StringSerializer.get())); MutationResult result = mutator.execute(); //... The write path is very the same to the read path. Similar to the read request a write request also includes the required consistency level. However, the coordinator node tries to send a write request including the mutated columns to all replica nodes for the row key. First, the StorageProxy of the coordinator node checks if enough replica notes for the row key are alive regarding to the requested consistency level. If this is true, the write request will be sent to the living replica nodes. If not, an error response will be returned. Write requests to temporarily failed replica nodes will be scheduled as a hinted handoff. This means that a hint will be written locally instead calling the failed node. Once the failed replica node is back the hint will be sent to this node to perform the write operation. By sending the hints the failed nodes becomes consistent to the other nodes. Please consider that hints will not longer store locally, if the failed node is dead longer than 1 hour (config param max_hint_window_in_ms). The coordinator node returns the response to the client as soon as the replica nodes conforming to the consistency level have confirmed the update (a hinted write will not count towards the requested consistency level). The updates of the other replica nodes will still be executed in background. If an error occurs by updating the replica nodes conforming to the consistency level, an error response will be returned. However, in this case the already updated nodes will not be reverted. Cassandra does not support distributed transactions, and hence it does not support a distributed rollback. The write operation supports an additional consistency level ANY which means that the mutated columns have to be written to at least one node regardless of whether this node is a replica node for the key or not. In contrast to consistency level ONE the write will also succeed, if a hinted handoff is written (by the coordinator node). However, in this case the mutated columns will not be readable until the responsible replica nodes have recovered. Gregor Roth Cassandra by example - the path of read and write requests 7
  • 8. Performing the local update Similar to the local data query a local update is triggered by handling a message through the MessagingService or by the StorageProxy. However, in contrast to the read path, first a commit log entry will be written for durability reasons. By default the commit log entry will be written in background asynchronously. The mutated columns will also be written into the in-memory Memtable of the column family. After inserting the changes the local update is completed. However, the memory size of a Memtable is limited. If the max size is exceeded, the Memtable will be written to disk as a new SSTable. This is done by a background thread which checks the current size of all unflushed Memtables of all ColumnFamilies, periodically. If a Memtable exceeds the max size, the background thread replaces the current Memtable by a new one. The old Memtable will be marked as pending flush and will be flushed by another thread. Under certain circumstances several pending Memtables for a column family could exists. After writing the Memtable to disk a new SSTableReader referring the written SSTable is created and added to the ColumnFamilyStore. Once written, the SSTable file is immutable. By default the SSTable data will be compressed (SnappyCompression). Compacting The SSTable file includes the modified columns of the row including their timestamps as well as additional row meta data. For instance the meta data section includes a (column name-scoped) Bloom Filter which is used to reduce disk IO by fetching columns by name. To reduce fragmentation and save space, SSTable files will be merged into a new SSTable file, occasionally. This compaction will be triggered by a background thread, if the compaction threshold is exceeded. The compaction threshold can be set for each column family. Gregor Roth Cassandra by example - the path of read and write requests 8
  • 9. About the author Gregor Roth works as a software architect at United Internet group, a leading European Internet Service Provider to which GMX, 1&1, and Web.de belong. His areas of interest include software and system architecture, enterprise architecture management, distributed computing, and development methodologies. Gregor Roth Cassandra by example - the path of read and write requests 9