The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.
http://tyfs.rocks
This is a presentation of the popular NoSQL database Apache Cassandra which was created by our team in the context of the module "Business Intelligence and Big Data Analysis".
Agenda
- What is NOSQL?
- Motivations for NOSQL?
- Brewer’s CAP Theorem
- Taxonomy of NOSQL databases
- Apache Cassandra
- Features
- Data Model
- Consistency
- Operations
- Cluster Membership
- What Does NOSQL means for RDBMS?
Basic Introduction to Cassandra with Architecture and strategies.
with big data challenge. What is NoSQL Database.
The Big Data Challenge
The Cassandra Solution
The CAP Theorem
The Architecture of Cassandra
The Data Partition and Replication
This is a presentation of the popular NoSQL database Apache Cassandra which was created by our team in the context of the module "Business Intelligence and Big Data Analysis".
Agenda
- What is NOSQL?
- Motivations for NOSQL?
- Brewer’s CAP Theorem
- Taxonomy of NOSQL databases
- Apache Cassandra
- Features
- Data Model
- Consistency
- Operations
- Cluster Membership
- What Does NOSQL means for RDBMS?
Basic Introduction to Cassandra with Architecture and strategies.
with big data challenge. What is NoSQL Database.
The Big Data Challenge
The Cassandra Solution
The CAP Theorem
The Architecture of Cassandra
The Data Partition and Replication
Here is my seminar presentation on No-SQL Databases. it includes all the types of nosql databases, merits & demerits of nosql databases, examples of nosql databases etc.
For seminar report of NoSQL Databases please contact me: ndc@live.in
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMIJCI JOURNAL
Apache Cassandra is a distributed storage system for managing very large amounts of structured data.
Cassandra provides highly available service with no single point of failure. Cassandra aims to run on top
of an infrastructure of hundreds of nodes possibly spread across different data centers with small and large
components fail continuously. Cassandra manages the persistent state in the face of the failures which
drives the reliability and scalability of the software systems. Cassandra does not support a full relational
data model because it resembles a database and shares many design and implementation strategies. In this
paper, discuss an implementation of Cassandra as Hotel Management System application. Cassandra
system was designed to run on cheap commodity hardware. Cassandra provides high write throughput and
read efficiency.
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
Comparison between mongo db and cassandra using ycsbsonalighai
Performed YCSB benchmarking test to check the performances of MongoDB and Cassandra for different workloads and a million opcounts and generated a report discussing clear insights.
NoSQL Databases: An Introduction and Comparison between Dynamo, MongoDB and C...Vivek Adithya Mohankumar
The research paper covers the consolidated interpretation of NoSQL systems, on the basis of performance, scalability and data aggregation, and compares the types of NoSQL databases based on their implementation and maintenance.
The project is focussed on Comparison Between HBASE and CASSANDRA using YCSB. It is a data storage and management project performed at National College Of Ireland
Here is my seminar presentation on No-SQL Databases. it includes all the types of nosql databases, merits & demerits of nosql databases, examples of nosql databases etc.
For seminar report of NoSQL Databases please contact me: ndc@live.in
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMIJCI JOURNAL
Apache Cassandra is a distributed storage system for managing very large amounts of structured data.
Cassandra provides highly available service with no single point of failure. Cassandra aims to run on top
of an infrastructure of hundreds of nodes possibly spread across different data centers with small and large
components fail continuously. Cassandra manages the persistent state in the face of the failures which
drives the reliability and scalability of the software systems. Cassandra does not support a full relational
data model because it resembles a database and shares many design and implementation strategies. In this
paper, discuss an implementation of Cassandra as Hotel Management System application. Cassandra
system was designed to run on cheap commodity hardware. Cassandra provides high write throughput and
read efficiency.
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
Comparison between mongo db and cassandra using ycsbsonalighai
Performed YCSB benchmarking test to check the performances of MongoDB and Cassandra for different workloads and a million opcounts and generated a report discussing clear insights.
NoSQL Databases: An Introduction and Comparison between Dynamo, MongoDB and C...Vivek Adithya Mohankumar
The research paper covers the consolidated interpretation of NoSQL systems, on the basis of performance, scalability and data aggregation, and compares the types of NoSQL databases based on their implementation and maintenance.
The project is focussed on Comparison Between HBASE and CASSANDRA using YCSB. It is a data storage and management project performed at National College Of Ireland
Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple data-centres,with asynchronous master-less replication allowing low latency operations for all clients.
This is a preliminary study and the objective of this study is to make simple distributed database system with some basic tutorials. Cassandra is a distributed database from Apache that is highly scalable and designed to accomplish very large amounts of organized data. Without having a single point of failure, it offers high accessibility. This report highlights with a basic outline of Cassandra trailed by its architecture, installation, and significant classes and interfaces. Subsequently, it proceeds to cover how to perform operations such as CREATE, ALTER, UPDATE, and DELETE on KEYSPACES, TABLES, and INDEXES using CQLSH using C#/.NET Client with a sample program done by ASP.NET(C#).
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAijfcstjournal
Apache Cassandra is a distributed storage system for managing very large amounts of structured data. Cassandra provides highly available service with no single point of failure. Cassandra aims to run on top of an infrastructure of hundreds of nodes possibly spread across different data centers with small and large components fail continuously. Cassandra manages the persistent state in the face of the failures which drives the reliability and scalability of the software systems. Cassandra does not support a full relational data model because it resembles a database and shares many design and implementation strategies. In this paper, discuss an implementation of Cassandra as Hotel Management System application. Cassandra system was designed to run on cheap commodity hardware. Cassandra provides high write throughput and read efficiency.
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAijfcstjournal
Apache Cassandra is a distributed storage system for managing very large amounts of structured data.
Cassandra provides highly available service with no single point of failure. Cassandra aims to run on top
of an infrastructure of hundreds of nodes possibly spread across different data centers with small and large
components fail continuously. Cassandra manages the persistent state in the face of the failures which
drives the reliability and scalability of the software systems. Cassandra does not support a full relational
data model because it resembles a database and shares many design and implementation strategies. In this
paper, discuss an implementation of Cassandra as Hotel Management System application. Cassandra
system was designed to run on cheap commodity hardware. Cassandra provides high write throughput and
read efficiency.
I have examined the performance of two databases - HBase and Cassandra in terms of their scalability, security, performance and compared the results thus obtained through different operations on the Ubuntu interface.
Devise and implement a test strategy in order to perform a comparative analysis of the capabilities of two database management systems (Cassandra and HBase) in terms of performance.
Approach: Installation and implementation of instances of the two data storage and management systems. The Yahoo Cloud Serving Benchmark is used to compare the performances of HBase and Cassandra. Average latency and throughput were considered for analyzing the comparison of the two databases. The results obtained from YCSB are then analyzed and visualized with the help of Tableau.
Findings: HBase performs insertion, reading, and updating of records faster than Cassandra but only when the operations count is less. At heavier loads, Cassandra performs better than Hbase.
Tools: Hbase, Cassandra, Hadoop, Tableau, YCSB
Cassandra training course is designed to provide knowledge and skills to become a successful Cassandra developer. In depth knowledge of concepts such as Clusters, Keyspaces, Column familes, Replication, Cassandra’s Data Model, Cassandra’s Architecture, Performance Tuning, How to read and write data and finally how to integrate Cassandra with Hadoop will be covered in this course.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
4. Cassandra Architecture – CAP Theorem
tyfs.rocks 426.07.2017
Cassandra was designed to fall in the “AP” intersection of
the CAP theorem that states that any distributed system can
only guarantee two of the following capabilities at same time;
Consistency, Availability and Partition Tolerance. In this way
Cassandra is a best fit for a solution seeking a distributed
database that brings high availability to a system and is also very
tolerant to partition to its data when some node in the cluster is
offline, which is common in distributed systems.
5. Cassandra Architecture – Data Model
tyfs.rocks 526.07.2017
Cassandra is classified as a column based database, which means that its
basic structure to store data is based upon a set of columns, which are
comprised, by a pair of column key and column value. Every row is identified
by a unique key, a string without a size limit, called partition key. Each set of
columns are called column families, similar to a relational database table.
6. Cassandra Architecture – Data Model
tyfs.rocks 626.07.2017
SortedMap<RowKey,SortedMap<ColumnKey, ColumnValue>>
A map gives efficient key lookup, and the sorted nature gives efficient scans. In Cassandra, we can use row keys and column
keys to do efficient lookups and range scans.
The number of column keys is unbounded. This means, you can have wide rows.
A key can itself hold a value, meaning In other words, you can have a valueless column.
7. Cassandra Architecture – Write Path
tyfs.rocks 726.07.2017
Cassandra Write Path
Every node first writes the mutation to the commit log
and then writes the mutation to the memtable.
Writing to the commit log ensures durability of the write
as the memtable is an in-memory structure and is only
written to disk when the memtable is flushed to disk. A
memtable is flushed to disk when:
• It reaches its maximum allocated size in memory
• The number of minutes a memtable can stay in
memory elapses.
• Manually flushed by a user
A memtable is flushed to an immutable structure called
and SSTable (Sorted String Table). The commit log is used
for playback purposes in case data from the memtable is
lost due to node failure.
Every SSTable creates three files on disk which include a
bloom filter, a key index and a data file.
8. Cassandra Architecture – Read Path
tyfs.rocks 826.07.2017
Cassandra Read Path
Every Column Family stores data in a number of
SSTables. Thus Data for a particular row can be located in
a number of SSTables and the memtable. Thus for every
read request Cassandra needs to read data from all
applicable SSTables ( all SSTables for a column family)
and scan the memtable for applicable data fragments.
This data is then merged and returned to the
coordinator.
If the contacted replicas has a different version of the
data the coordinator returns the latest version to the
client and issues a read repair command to the
node/nodes with the older version of the data. The read
repair operation pushes the newer version of the data to
nodes with the older version.
9. Cassandra Architecture – Cluster Topology
tyfs.rocks 926.07.2017
Cluster Concepts
a node is a cassandra instance (in
production: one node per machine)
a partition is one ordered and replicable
unit of data on a node
a rack is a logical set of nodes
a Data Center is a logical set or racks
Cluster is the full set of nodes which
map to a single complete token ring
peer-to-peer communication gossip
protocol
10. Cassandra Architecture – Data Consistency
tyfs.rocks 1026.07.2017
Tunable Data Consistency
How many nodes must acknowledge a
read/write request
choose between STRONG to
EVENTUAL
possible CL: ANY, ONE, QUORUM
(RF/2+1), ALL
tunable per request support
multi-datacenter support
11. Cassandra Architecture – CQL Language
tyfs.rocks 1126.07.2017
Cassandra Query Language
very similar to RDBMS SQL syntax
create objects via DDL
core DML commands insert,
update, delete supported
query data with Select commands
12. Cassandra Architecture – Security
tyfs.rocks 1226.07.2017
Cassandra Security Features
Authentication based on internally
controlled rolename/passwords
Authorization based on object
permission management
Authentication and authorization
based on JMX
username/passwords
SSL encryption
13. Why Cassandra ?
tyfs.rocks 1326.07.2017
• Scales linearly with massive write
Cassandra is a great database which can handle a big amount of data. So it is preferred for the companies that provide
Mobile phones and messaging services. These companies have a huge amount of data, so Cassandra is best for them.
• Highly Fault Tolerant
Masterless cluster with no single point of failure. In simple terms, your users will never know if a server, an entire rack
of servers, or even if an entire data center fails. There is also the potential for zero downtime rolling upgrades.
• Easy Replication / Data Distribution
• Homogenous Environment
No master-slave or sharding setup and that all nodes in the ring are equal.
• Ease of Administration
Masterless, fault-tolerant, supports temporary loss of nodes with minimal impact to production performance.
• Wide Community
No master-slave or sharding setup and that all nodes in the ring are equal.
14. Use Cases of Cassandra
tyfs.rocks 1426.07.2017
• Messaging & Event Sourcing
Cassandra is a great database which can handle a big amount of data. So it is preferred for the companies that provide
Mobile phones and messaging services. These companies have a huge amount of data, so Cassandra is best for them.
• IoT & High Speed Applications
Cassandra can handle the high speed data so it is a great database for the applications where data is coming at very
high speed from different devices or sensors.
• Product Catalogs and Retail Apps
Cassandra is used by many retailers for durable shopping cart protection and fast product catalog input and output.
• Social Media Analytics & Recommendations
Cassandra is a great database for many online companies and social media providers for analysis and
recommendation to their customers.
15. Cassandra for Akka Persistence
tyfs.rocks 1526.07.2017
• Linear scalability
Expected Massive Load
• No SPOF
Fault-tolerant, Resilient
• Always-On Multi-Data Center
Data Distribution & Replication
Cluster over Multi-Data Centers
• AKKA Persistence
CQRS with Event-Sourcing
Akka’s supported up to date plugin
(Lightbend)
• Akka Streams
Batch Processing over Streaming
16. Cassandra Benchmarks
tyfs.rocks 1626.07.2017
University of TORONTO, NoSQL Database Performance Benchmarks, 2012
Write latency for workload read/write
Throughput for workload read/scan/write
Read latency for workload read/write
Throughput for workload read/write
20. Resources
tyfs.rocks 2026.07.2017
• Apache Cassandra Web Site
• Planet Cassandra Community
• DataStax Web Site
• The Distributed Architecture Behind Apache Cassandra, Bruno TINOCO
• Introduction to Apache Cassandra's Architecture, Akhil Mehra
• An Overview of Apache Cassandra, DataStax
• NoSQL Performance Benchmarks, DataStax
• Top 10 Reasons to Use Cassandra, Michael COLBY
• Security in Cassandra, IBM Developer Works
Editor's Notes
Cassandra was designed to fall in the “AP” intersection of the CAP theorem that states that any distributed system can only guarantee two of the following capabilities at same time; Consistency, Availability and Partition tolerance. In this way Cassandra is a best fit for a solution seeking a distributed database that brings high availability to a system and is also very tolerant to partition to its data when some node in the cluster is offline, which is common in distributed systems.
Cassandra was designed to fall in the “AP” intersection of the CAP theorem that states that any distributed system can only guarantee two of the following capabilities at same time; Consistency, Availability and Partition tolerance. In this way Cassandra is a best fit for a solution seeking a distributed database that brings high availability to a system and is also very tolerant to partition to its data when some node in the cluster is offline, which is common in distributed systems.
Cassandra was designed to fall in the “AP” intersection of the CAP theorem that states that any distributed system can only guarantee two of the following capabilities at same time; Consistency, Availability and Partition tolerance. In this way Cassandra is a best fit for a solution seeking a distributed database that brings high availability to a system and is also very tolerant to partition to its data when some node in the cluster is offline, which is common in distributed systems.
Cassandra was designed to fall in the “AP” intersection of the CAP theorem that states that any distributed system can only guarantee two of the following capabilities at same time; Consistency, Availability and Partition tolerance. In this way Cassandra is a best fit for a solution seeking a distributed database that brings high availability to a system and is also very tolerant to partition to its data when some node in the cluster is offline, which is common in distributed systems.
Cassandra was designed to fall in the “AP” intersection of the CAP theorem that states that any distributed system can only guarantee two of the following capabilities at same time; Consistency, Availability and Partition tolerance. In this way Cassandra is a best fit for a solution seeking a distributed database that brings high availability to a system and is also very tolerant to partition to its data when some node in the cluster is offline, which is common in distributed systems.
Each node processes the request individually. Every node first writes the mutation to the commit log and then writes the mutation to the memtable. Writing to the commit log ensures durability of the write as the memtable is an in-memory structure and is only written to disk when the memtable is flushed to disk. A memtable is flushed to disk when:
It reaches its maximum allocated size in memory
The number of minutes a memtable can stay in memory elapses.
Manually flushed by a user
A memtable is flushed to an immutable structure called and SSTable (Sorted String Table). The commit log is used for playback purposes in case data from the memtable is lost due to node failure. For example the machine has a power outage before the memtable could get flushed. Every SSTable creates three files on disk which include a bloom filter, a key index and a data file. Over a period of time a number of SSTables are created. This results in the need to read multiple SSTables to satisfy a read request. Compaction is the process of combining SSTables so that related data can be found in a single SSTable. This helps with making reads much faster.
At the cluster level a read operation is similar to a write operation. As with the write path the client can connect with any node in the cluster. The chosen node is called the coordinator and is responsible for returning the requested data. A row key must be supplied for every read operation. The coordinator uses the row key to determine the first replica. The replication strategy in conjunction with the replication factor is used to determine all other applicable replicas. As with the write path the consistency level determines the number of replica's that must respond before successfully returning data. Let's assume that the request has a consistency level of QUORUM and a replication factor of three, thus requiring the coordinator to wait for successful replies from at least two nodes. If the contacted replicas has a different version of the data the coordinator returns the latest version to the client and issues a read repair command to the node/nodes with the older version of the data. The read repair operation pushes the newer version of the data to nodes with the older version.
On a per SSTable basis the operation becomes a bit more complicated. The illustration above outlines key steps that take place when reading data from an SSTable. Every SSTable has an associated bloom filter which enables it to quickly ascertain if data for the requested row key exists on the corresponding SSTable. This reduces IO when performing an row key lookup. A bloom filter is always held in memory since the whole purpose is to save disk IO. Cassandra also keeps a copy of the bloom filter on disk which enables it to recreate the bloom filter in memory quickly . Cassandra does not store the bloom filter Java Heap instead makes a separate allocation for it in memory. If the bloom filter returns a negative response no data is returned from the particular SSTable. This is a common case as the compaction operation tries to group all row key related data into as few SSTables as possible. If the bloom filter provides a positive response the partition key cache is scanned to ascertain the compression offset for the requested row key. It then proceeds to fetch the compressed data on disk and returns the result set. If the partition cache does not contain a corresponding entry the partition key summary is scanned. The partition summary is a subset to the partition index and helps determine the approximate location of the index entry in the partition index. The partition index is then scanned to locate the compression offset which is then used to find the appropriate data on disk. If you reached the end of this long post then well done. In this post I have provided an introduction to Cassandra architecture. In my upcoming posts I will try and explain Cassandra architecture using a more practical approach.
Cassandra was designed to fall in the “AP” intersection of the CAP theorem that states that any distributed system can only guarantee two of the following capabilities at same time; Consistency, Availability and Partition tolerance. In this way Cassandra is a best fit for a solution seeking a distributed database that brings high availability to a system and is also very tolerant to partition to its data when some node in the cluster is offline, which is common in distributed systems.
Cassandra was designed to fall in the “AP” intersection of the CAP theorem that states that any distributed system can only guarantee two of the following capabilities at same time; Consistency, Availability and Partition tolerance. In this way Cassandra is a best fit for a solution seeking a distributed database that brings high availability to a system and is also very tolerant to partition to its data when some node in the cluster is offline, which is common in distributed systems.
Cassandra was designed to fall in the “AP” intersection of the CAP theorem that states that any distributed system can only guarantee two of the following capabilities at same time; Consistency, Availability and Partition tolerance. In this way Cassandra is a best fit for a solution seeking a distributed database that brings high availability to a system and is also very tolerant to partition to its data when some node in the cluster is offline, which is common in distributed systems.
Authentication based on internally controlled rolename/passwordsCassandra authentication is roles-based and stored internally in Cassandra system tables. Administrators can create, alter, drop, or list roles using CQL commands, with an associated password. Roles can be created with superuser, non-superuser, and login privileges. The internal authentication is used to access Cassandra keyspaces and tables, and by cqlsh and DevCenter to authenticate connections to Cassandra clusters and sstableloader to load SSTables.
Authorization based on object permission managementAuthorization grants access privileges to Cassandra cluster operations based on role authentication. Authorization can grant permission to access the entire database or restrict a role to individual table access. Roles can grant authorization to authorize other roles. Roles can be granted to roles. CQL commands GRANT and REVOKE are used to manage authorization.
Authentication and authorization based on JMX username/passwordsJMX (Java Management Extensions) technology provides a simple and standard way of managing and monitoring resources related to an instance of a Java Virtual Machine (JVM). This is achieved by instrumenting resources with Java objects known as Managed Beans (MBeans) that are registered with an MBean server. JMX authentication stores username and associated passwords in two files, one for passwords and one for access. JMX authentication is used by nodetool and external monitoring tools such as jconsole.In Cassandra 3.6 and later, JMX authentication and authorization can be accomplished using Cassandra's internal authentication and authorization capabilities.
SSL encryptionCassandra provides secure communication between a client and a database cluster, and between nodes in a cluster. Enabling SSL encryption ensures that data in flight is not compromised and is transferred securely. Client-to-node and node-to-node encryption are independently configured. Cassandra tools (cqlsh, nodetool, DevCenter) can be configured to use SSL encryption. The DataStax drivers can be configured to secure traffic between the driver and Cassandra.
General security measuresTypically, production Cassandra clusters will have all non-essential firewall ports closed. Some ports must be open in order for nodes to communicate in the cluster. These ports are detailed.
Goals for the Tests
Select workloads that are typical of today’s modern applications
Use data volumes that are representative of ‘big data’ datasets that exceed the RAM capacity for each node
Ensure that all data written was done in a manner that allowed no data loss (i.e. durable writes), which is what most production environments require
Tested Workloads
The following workloads were included in the benchmark:
Read-mostly workload, based on YCSB’s provided workload B: 95% read to 5% update ratio
Read/write combination, based on YCSB’s workload A: 50% read to 50% update ratio
Read-modify-write, based on YCSB workload F: 50% read to 50% read-modify-write
Mixed operational and analytical: 60% read, 25% update, 10% insert, and 5% scan
Insert-mostly combined with read: 90% insert to 10% read ratio