This document provides an overview of Cassandra, including:
- Cassandra is a distributed, column-oriented database that is highly scalable and has no single point of failure.
- It compares Cassandra to relational databases, noting Cassandra's flexible schema and lack of joins.
- The architecture includes keyspaces, tables and columns, with replication specified at the keyspace level.
- Queries in Cassandra Query Language (CQL) have limitations compared to other databases.
My study notes on the Apache Spark papers from Hotcloud2010 and NSDI2012. The paper talks about a distributed data processing system that aims to cover more general-purpose use cases than the Google MapReduce framework.
Presentation slides for the paper on Resilient Distributed Datasets, written by Matei Zaharia et al. at the University of California, Berkeley.
The paper is not my work.
These slides were made for the course on Advanced, Distributed Systems held by prof. Bratsberg at NTNU (Norwegian University of Science and Technology, Trondheim, Norway).
Cassandra is a highly scalable, eventually consistent, distributed, structured columnfamily store with no single points of failure, initially open-sourced by Facebook and now part of the Apache Incubator. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7975
My study notes on the Apache Spark papers from Hotcloud2010 and NSDI2012. The paper talks about a distributed data processing system that aims to cover more general-purpose use cases than the Google MapReduce framework.
Presentation slides for the paper on Resilient Distributed Datasets, written by Matei Zaharia et al. at the University of California, Berkeley.
The paper is not my work.
These slides were made for the course on Advanced, Distributed Systems held by prof. Bratsberg at NTNU (Norwegian University of Science and Technology, Trondheim, Norway).
Cassandra is a highly scalable, eventually consistent, distributed, structured columnfamily store with no single points of failure, initially open-sourced by Facebook and now part of the Apache Incubator. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7975
Basic stuff You Need to Know about CassandraYu-Chang Ho
This slide is intended to instruct the backend service team members of PM2.5 Open Data Service (pm25.lass-net.org) to learn the basic stuff about Apache Cassandra.
Design of a lightweight set of data pipelines to scrub PII information.
Scrubbing PII information from data brings ease of sharing data.
It also helps organisations to confidently push data outside organisation for large scale analytics on the cloud.
The design and implementation of modern column oriented databasesTilak Patidar
An attempt to break down the paper on the design of column-oriented databases into simpler terms.
https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf
https://blog.acolyer.org/2018/09/26/the-design-and-implementation-of-modern-column-oriented-database-systems/
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMIJCI JOURNAL
Apache Cassandra is a distributed storage system for managing very large amounts of structured data.
Cassandra provides highly available service with no single point of failure. Cassandra aims to run on top
of an infrastructure of hundreds of nodes possibly spread across different data centers with small and large
components fail continuously. Cassandra manages the persistent state in the face of the failures which
drives the reliability and scalability of the software systems. Cassandra does not support a full relational
data model because it resembles a database and shares many design and implementation strategies. In this
paper, discuss an implementation of Cassandra as Hotel Management System application. Cassandra
system was designed to run on cheap commodity hardware. Cassandra provides high write throughput and
read efficiency.
Basic stuff You Need to Know about CassandraYu-Chang Ho
This slide is intended to instruct the backend service team members of PM2.5 Open Data Service (pm25.lass-net.org) to learn the basic stuff about Apache Cassandra.
Design of a lightweight set of data pipelines to scrub PII information.
Scrubbing PII information from data brings ease of sharing data.
It also helps organisations to confidently push data outside organisation for large scale analytics on the cloud.
The design and implementation of modern column oriented databasesTilak Patidar
An attempt to break down the paper on the design of column-oriented databases into simpler terms.
https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf
https://blog.acolyer.org/2018/09/26/the-design-and-implementation-of-modern-column-oriented-database-systems/
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMIJCI JOURNAL
Apache Cassandra is a distributed storage system for managing very large amounts of structured data.
Cassandra provides highly available service with no single point of failure. Cassandra aims to run on top
of an infrastructure of hundreds of nodes possibly spread across different data centers with small and large
components fail continuously. Cassandra manages the persistent state in the face of the failures which
drives the reliability and scalability of the software systems. Cassandra does not support a full relational
data model because it resembles a database and shares many design and implementation strategies. In this
paper, discuss an implementation of Cassandra as Hotel Management System application. Cassandra
system was designed to run on cheap commodity hardware. Cassandra provides high write throughput and
read efficiency.
This is a preliminary study and the objective of this study is to make simple distributed database system with some basic tutorials. Cassandra is a distributed database from Apache that is highly scalable and designed to accomplish very large amounts of organized data. Without having a single point of failure, it offers high accessibility. This report highlights with a basic outline of Cassandra trailed by its architecture, installation, and significant classes and interfaces. Subsequently, it proceeds to cover how to perform operations such as CREATE, ALTER, UPDATE, and DELETE on KEYSPACES, TABLES, and INDEXES using CQLSH using C#/.NET Client with a sample program done by ASP.NET(C#).
This presentation explains how to get started with Apache Cassandra to provide a scale out, fault tolerant backend for inventory storage on OpenSimulator.
Comparison between mongo db and cassandra using ycsbsonalighai
Performed YCSB benchmarking test to check the performances of MongoDB and Cassandra for different workloads and a million opcounts and generated a report discussing clear insights.
2. It is a distributed database from Apache .
It is highly scalable and designed to manage very large amounts of
structured data.
High availability with no single point of failure.
It is a column-oriented database
2
Cassandra Overview
3. Cassandra RDBMS
It is used to deal with unstructured data. It is used to deal with structured data.
Flexible schema Fixed Schema
Relationships are represented using
collections.
In RDBMS, there are concept of foreign keys,
joins etc.
It won’t support Join’s It support Join’s
3
Cassandra Vs RDBMS
4. Cassandra is to handle big data workloads across multiple
nodes without any single point of failure.
Cassandra has peer-to-peer distributed system across its
nodes.
Data is distributed among all the nodes in a cluster.
Advantages and Applicable Area
Open Source
Peer to peer
High Availability & performance..
4
Cassandra Architecture
5. The components of Cassandra data model are keyspaces,
tables, and columns.
Keyspaces - is the outermost container for data in Cassandra.
◦ no default keyspace
◦ Replication is specified at the keyspace level.
5
Cassandra Data Model
6. CQL does not support aggregation queries like max, min, avg
CQL does not support group by, having queries.
CQL does not support joins.
CQL does not support OR queries.
CQL does not support wildcard queries.
CQL does not support Union, Intersection queries.
Table columns cannot be filtered without creating the index.
Greater than (>) and less than (<) query is only supported on
clustering column.Cassandra query language is not suitable
for analytics purposes because it has so many limitations.
6
Cassandra Query Language (CQL) and cqlsh
7. It is the internal communication technique for nodes in a cluster to talk to each other.
It runs every second for every node and exchange state messages with up to three other nodes in the
cluster.
7
Gossip and Snitching
8. Snitch job is to determine which data centers and racks it should use to read data from and write data to.
Types of Snitches:
SimpleSnitch
GossipingPropertyFileSnitch
PropertyFileSnitch
Ec2Snitch
Ec2MultiRegionSnitch
RackInferringSnitch
8
Gossip and Snitching
9. Compaction refers to a maintenance process in Cassandra, in which the SSTables are reorganized for
data optimization of data structures on the disk.
It is useful during interacting with memtables.
There are two types of compaction in Cassandra.
◦ Minor compaction: It gets started automatically when a new SSTable is created. Here, Cassandra
condenses all the equally sized SSTables into one.
◦ Major compaction: It is triggered manually using the nodetool. It compacts all SSTables of a column
family into one.
9
Compaction
10. Bloom filters are a probabilistic data structure that allows Cassandra to determine one of two
possible states: - The data definitely does not exist in the given file, or - The data probably
exists in the given file.
It checks if the requested row exists in the SSTable before doing any disk I/O.
To change the Bloom filter attribute on a column family,
◦ ALTER TABLE addamsFamily WITH bloom_filter_fp_chance = 0.01;
10
Bloom Filter
11. It is designed to capture insert, update, and delete activity applied to tables(column families), and to make the
details of the changes available in an easily consumed format.
CDC logs use the same binary format as the commit log.
After the disk space limit is reached, CDC-enabled tables reject writes until space is freed.
Enable CDC logging and configure CDC directories and space in cassandra.yaml.
cdc_enabled: true
cdc_total_space_in_mb: 4096
cdc_free_space_check_interval_ms: 250
cdc_raw_directory: /var/lib/cassandra/cdc_raw
To enable CDC logging for a database table
CREATE TABLE foo (a int, b text, PRIMARY KEY(a)) WITH cdc=true;
ALTER TABLE <Table_NAme> WITH cdc=true;
11
Change Data Capture
12. NODETOOL
It is a basic tool, bundled in the Cassandra distribution, for node management and statistics gathering.
Nodetool shows cluster status, compactions, bootstrap streams and much more.
It is a very important source of information, but it's just a CLI tool without any storage or visualization
capabilities.
12
Monitoring
13. JMX & REPORTERS
Cassandra exposes all its metrics via JMX (by default on port 7199).
JMX can be read e.g. with jconsole or jvisualvm with VisualVM-MBeans plugin (both tools bundled in JDK
distributions).
By default remote JMX is disabled. If you really need it, you can enable it in cassandra-env.sh
DATASTAX OPSCENTER
It is a monitoring and management solution.
It is also capable of system monitoring.
Every node needs to have an OpsCenter agent installed, which sends data to the main OpsCenter service,
which in turn stores them in a Cassandra keyspace.
It is compatible with the open source Cassandra up to version 2.1.
13
Monitoring