Cassandra Tutorial
Apache Cassandra is a free open source
and distributed database management
system.It is highly scalable and designed
to manage very large amounts of
structured data. It provides high
availability with no single point of failure.
NoSQLDatabase
• A NoSQL database (sometimes called as Not Only SQL) is a
database that provides a mechanism to store and retrieve data other
than the tabular relations used in relational databases. These
databases are schema-free, support easy replication, have simple
API, eventually consistent, and can handle huge amounts of data.
• The primary objective of a NoSQL database is to have
• simplicity of design,
• horizontal scaling
• finer control over availability.
• NoSql databases use different data structures compared to
relational databases. It makes some operations faster in NoSQL. The
suitability of a given NoSQL database depends on the problem it
must solve.
• Apache Cassandra is an open source distributed database
system that is designed for storing and managing large
amounts of data across commodity servers. Cassandra can
serve as both a real-time operational data store for online
transactional applications and a read-intensive database for
large-scale business intelligence systems.
• Originally created for facebook, Cassandra is designed to have
peer to peer symmetric nodes, instead of master or named
nodes, to ensure there can never be a single point of failure
Cassandra automatically partitions data across all the nodes
in the database cluster, but the administrator has the power to
determine what data will be replicated and how many copies
of the data will be created.
Features of Cassandra
• Cassandra Features:
• Elastic scalability - Cassandra is highly scalable; it allows to add more hardware to
accommodate more customers and more data as per requirement.
• Always on architecture - Cassandra has no single point of failure and it is continuously
available for business-critical applications that cannot afford a failure.
• Fast linear-scale performance - Cassandra is linearly scalable, i.e., it increases your
throughput as you increase the number of nodes in the cluster. Therefore it maintains a
quick response time.
• Flexible data storage - Cassandra accommodates all possible data formats including:
structured, semi-structured, and unstructured. It can dynamically accommodate changes to
your data structures according to your need.
• Easy data distribution - Cassandra provides the flexibility to distribute data where you
need by replicating data across multiple data centers.
• Transaction support - Cassandra supports properties like Atomicity, Consistency,
Isolation, and Durability (ACID).
• Fast writes - Cassandra was designed to run on cheap commodity hardware. It performs
blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the
read efficiency.
Components of Cassandra
• Cassandra uses the Gossip Protocol in the background to allow the nodes
to communicate with each other and detect any faulty nodes in the cluster.
• The key components of Cassandra are as follows −
• Node − It is the place where data is stored.
• Data center − It is a collection of related nodes.
• Cluster − A cluster is a component that contains one or more data centers.
• Commit log − The commit log is a crash-recovery mechanism in
Cassandra. Every write operation is written to the commit log.
• Mem-table − A mem-table is a memory-resident data structure. After
commit log, the data will be written to the mem-table. Sometimes, for a
single-column family, there will be multiple mem-tables.
• SSTable − It is a disk file to which the data is flushed from the mem-table
when its contents reach a threshold value.
• Bloom filter − These are nothing but quick, nondeterministic, algorithms
for testing whether an element is a member of a set. It is a special kind of
cache. Bloom filters are accessed after every query.
Apache Cassandra data types
• Apache Cassandra NoSQL DBMS supports the most
common data types, including ASCII, bigint, BLOB,
Boolean, counter, decimal, double, float, int, text,
timestamp, UUID, VARCHAR and varint.
• Cassandra's data model offers the convenience of
column indexes with the performance of log-
structured updates, strong support for
denormalization and materialized views, and built-
in caching.
• Data access is performed using Cassandra Query
Language (CQL), which resembles SQL.
Cassandra Query Language
• Users can access Cassandra through its nodes using
Cassandra Query Language (CQL). CQL treats the
database (Keyspace) as a container of tables.
Programmers use cqlsh: a prompt to work with CQL or
separate application language drivers.
• Clients approach any of the nodes for their read-write
operations. That node (coordinator) plays a proxy
between the client and the nodes holding the data.
• Data storage in Cassandra is row-oriented, meaning that
all contents of a row are serialized together on disk.
Every row of columns has its unique key. Each row can
hold up to 2 billion columns .Furthermore, each row
must fit onto a single server, because data is partitioned
solely by row-key.
• To understand why databases like Cassandra, HBase and
BigTable (I’ll call them DSS, Distributed Storage
Services, from now on) were designed the way they are,
we’ll first have to understand what they were built to be
used for.
• DSS(A decision support system (DSS) is a computer-based
information system that supports business or organizational
decision-making activities. were designed to handle enormous
amounts of data, stored in billions of rows on large clusters.
Relational databases incorporate a lot of things that make it hard to
efficiently distribute them over multiple machines. DSS simply
remove some or all of these ties. No operations are allowed, that
require scanning extensive parts of the dataset, meaning no JOINS
or rich-queries
• Cassandra is a NoSQL Column family implementation supporting
the Big Table data model using the architectural aspects introduced
by Amazon Dynamo.
column family
• Cassandra consists of many storage nodes and stores each row
within a single storage node. Within each row, Cassandra
always stores columns sorted by their column names. Using
this sort order, Cassandra supports slice queries where given a
row, users can retrieve a subset of its columns falling within a
given column name range. For example, a slice query with
range tag0 to tag9999 will get all the columns whose names
fall between tag0 and tag9999.
• Keyspace – a group of many column families together. It is
only a logical grouping of column families and provides an
isolated scope for names.
• Finally, super columns reside within a column family that
groups several columns under a one key.
• Cassandra provides very fast writes, and they are actually
faster than reads where it can transfer data about 80-
360MB/sec per node. It achieves this using two
techniques.Cassandra keeps most of the data within memory
at the responsible node, and any updates are done in the
memory and written to the persistent storage (file system) in a
lazy fashion. To avoid losing data, however, Cassandra writes
all transactions to a commit log in the disk. Unlike updating
data items in the disk, writes to commit logs are append-only
and, therefore, avoid rotational delay while writing to the
disk. For more information on disk-drive performance
characteristics, see Resources.
• Unless writes have requested full consistency, Cassandra writes data to enough
nodes without resolving any data inconsistencies where it resolves
inconsistencies only at the first read. This process is called "read repair.“
• Healing from failure is manual
• If a node in a Cassandra cluster has failed, the cluster will continue to work if
you have replicas. Full recovery, which is to redistribute data and compensate
for missing replicas, is a manual operation through a command line tool
called node tool. Also, while the manual operation happens, the system will be
unavailable.
• It remembers deletes
• Cassandra is designed such that it continues to work without a problem even if a
node goes down (or gets disconnected) and comes back later. A consequence is
this complicates data deletions. For example, assume a node is down. While
down, a data item has been deleted in replicas. When the unavailable node
comes back on, it will reintroduce the deleted data item at the syncing process
unless Cassandra remembers that data item has been deleted.

Cassandra tutorial

  • 1.
    Cassandra Tutorial Apache Cassandrais a free open source and distributed database management system.It is highly scalable and designed to manage very large amounts of structured data. It provides high availability with no single point of failure.
  • 2.
    NoSQLDatabase • A NoSQLdatabase (sometimes called as Not Only SQL) is a database that provides a mechanism to store and retrieve data other than the tabular relations used in relational databases. These databases are schema-free, support easy replication, have simple API, eventually consistent, and can handle huge amounts of data. • The primary objective of a NoSQL database is to have • simplicity of design, • horizontal scaling • finer control over availability. • NoSql databases use different data structures compared to relational databases. It makes some operations faster in NoSQL. The suitability of a given NoSQL database depends on the problem it must solve.
  • 3.
    • Apache Cassandrais an open source distributed database system that is designed for storing and managing large amounts of data across commodity servers. Cassandra can serve as both a real-time operational data store for online transactional applications and a read-intensive database for large-scale business intelligence systems. • Originally created for facebook, Cassandra is designed to have peer to peer symmetric nodes, instead of master or named nodes, to ensure there can never be a single point of failure Cassandra automatically partitions data across all the nodes in the database cluster, but the administrator has the power to determine what data will be replicated and how many copies of the data will be created.
  • 4.
    Features of Cassandra •Cassandra Features: • Elastic scalability - Cassandra is highly scalable; it allows to add more hardware to accommodate more customers and more data as per requirement. • Always on architecture - Cassandra has no single point of failure and it is continuously available for business-critical applications that cannot afford a failure. • Fast linear-scale performance - Cassandra is linearly scalable, i.e., it increases your throughput as you increase the number of nodes in the cluster. Therefore it maintains a quick response time. • Flexible data storage - Cassandra accommodates all possible data formats including: structured, semi-structured, and unstructured. It can dynamically accommodate changes to your data structures according to your need. • Easy data distribution - Cassandra provides the flexibility to distribute data where you need by replicating data across multiple data centers. • Transaction support - Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID). • Fast writes - Cassandra was designed to run on cheap commodity hardware. It performs blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the read efficiency.
  • 5.
    Components of Cassandra •Cassandra uses the Gossip Protocol in the background to allow the nodes to communicate with each other and detect any faulty nodes in the cluster. • The key components of Cassandra are as follows − • Node − It is the place where data is stored. • Data center − It is a collection of related nodes. • Cluster − A cluster is a component that contains one or more data centers. • Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log. • Mem-table − A mem-table is a memory-resident data structure. After commit log, the data will be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables. • SSTable − It is a disk file to which the data is flushed from the mem-table when its contents reach a threshold value. • Bloom filter − These are nothing but quick, nondeterministic, algorithms for testing whether an element is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query.
  • 6.
    Apache Cassandra datatypes • Apache Cassandra NoSQL DBMS supports the most common data types, including ASCII, bigint, BLOB, Boolean, counter, decimal, double, float, int, text, timestamp, UUID, VARCHAR and varint. • Cassandra's data model offers the convenience of column indexes with the performance of log- structured updates, strong support for denormalization and materialized views, and built- in caching. • Data access is performed using Cassandra Query Language (CQL), which resembles SQL.
  • 7.
    Cassandra Query Language •Users can access Cassandra through its nodes using Cassandra Query Language (CQL). CQL treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to work with CQL or separate application language drivers. • Clients approach any of the nodes for their read-write operations. That node (coordinator) plays a proxy between the client and the nodes holding the data.
  • 8.
    • Data storagein Cassandra is row-oriented, meaning that all contents of a row are serialized together on disk. Every row of columns has its unique key. Each row can hold up to 2 billion columns .Furthermore, each row must fit onto a single server, because data is partitioned solely by row-key. • To understand why databases like Cassandra, HBase and BigTable (I’ll call them DSS, Distributed Storage Services, from now on) were designed the way they are, we’ll first have to understand what they were built to be used for.
  • 9.
    • DSS(A decisionsupport system (DSS) is a computer-based information system that supports business or organizational decision-making activities. were designed to handle enormous amounts of data, stored in billions of rows on large clusters. Relational databases incorporate a lot of things that make it hard to efficiently distribute them over multiple machines. DSS simply remove some or all of these ties. No operations are allowed, that require scanning extensive parts of the dataset, meaning no JOINS or rich-queries • Cassandra is a NoSQL Column family implementation supporting the Big Table data model using the architectural aspects introduced by Amazon Dynamo.
  • 10.
    column family • Cassandraconsists of many storage nodes and stores each row within a single storage node. Within each row, Cassandra always stores columns sorted by their column names. Using this sort order, Cassandra supports slice queries where given a row, users can retrieve a subset of its columns falling within a given column name range. For example, a slice query with range tag0 to tag9999 will get all the columns whose names fall between tag0 and tag9999. • Keyspace – a group of many column families together. It is only a logical grouping of column families and provides an isolated scope for names. • Finally, super columns reside within a column family that groups several columns under a one key.
  • 11.
    • Cassandra providesvery fast writes, and they are actually faster than reads where it can transfer data about 80- 360MB/sec per node. It achieves this using two techniques.Cassandra keeps most of the data within memory at the responsible node, and any updates are done in the memory and written to the persistent storage (file system) in a lazy fashion. To avoid losing data, however, Cassandra writes all transactions to a commit log in the disk. Unlike updating data items in the disk, writes to commit logs are append-only and, therefore, avoid rotational delay while writing to the disk. For more information on disk-drive performance characteristics, see Resources.
  • 12.
    • Unless writeshave requested full consistency, Cassandra writes data to enough nodes without resolving any data inconsistencies where it resolves inconsistencies only at the first read. This process is called "read repair.“ • Healing from failure is manual • If a node in a Cassandra cluster has failed, the cluster will continue to work if you have replicas. Full recovery, which is to redistribute data and compensate for missing replicas, is a manual operation through a command line tool called node tool. Also, while the manual operation happens, the system will be unavailable. • It remembers deletes • Cassandra is designed such that it continues to work without a problem even if a node goes down (or gets disconnected) and comes back later. A consequence is this complicates data deletions. For example, assume a node is down. While down, a data item has been deleted in replicas. When the unavailable node comes back on, it will reintroduce the deleted data item at the syncing process unless Cassandra remembers that data item has been deleted.