Cassandra advanced part-ll

Cassandra
column-oriented database
Presentation by
Dhivya Ramasamy
Email:achuhivi08@gmail.com
1

 It is a distributed database from Apache .
 It is highly scalable and designed to manage very large amounts of
structured data.
 High availability with no single point of failure.
 It is a column-oriented database
2
Cassandra Overview

Cassandra RDBMS
It is used to deal with unstructured data. It is used to deal with structured data.
Flexible schema Fixed Schema
Relationships are represented using
collections.
In RDBMS, there are concept of foreign keys,
joins etc.
It won’t support Join’s It support Join’s
3
Cassandra Vs RDBMS

 Cassandra is to handle big data workloads across multiple
nodes without any single point of failure.
 Cassandra has peer-to-peer distributed system across its
nodes.
 Data is distributed among all the nodes in a cluster.
Advantages and Applicable Area
 Open Source
 Peer to peer
 High Availability & performance..
4
Cassandra Architecture

 The components of Cassandra data model are keyspaces,
tables, and columns.
 Keyspaces - is the outermost container for data in Cassandra.
◦ no default keyspace
◦ Replication is specified at the keyspace level.

5
Cassandra Data Model

 CQL does not support aggregation queries like max, min, avg
 CQL does not support group by, having queries.
 CQL does not support joins.
 CQL does not support OR queries.
 CQL does not support wildcard queries.
 CQL does not support Union, Intersection queries.
 Table columns cannot be filtered without creating the index.
 Greater than (>) and less than (<) query is only supported on
clustering column.Cassandra query language is not suitable
for analytics purposes because it has so many limitations.
6
Cassandra Query Language (CQL) and cqlsh

 It is the internal communication technique for nodes in a cluster to talk to each other.
 It runs every second for every node and exchange state messages with up to three other nodes in the
cluster.
7
Gossip and Snitching

 Snitch job is to determine which data centers and racks it should use to read data from and write data to.
 Types of Snitches:
 SimpleSnitch
 GossipingPropertyFileSnitch
 PropertyFileSnitch
 Ec2Snitch
 Ec2MultiRegionSnitch
 RackInferringSnitch
8
Gossip and Snitching

 Compaction refers to a maintenance process in Cassandra, in which the SSTables are reorganized for
data optimization of data structures on the disk.
 It is useful during interacting with memtables.
 There are two types of compaction in Cassandra.
◦ Minor compaction: It gets started automatically when a new SSTable is created. Here, Cassandra
condenses all the equally sized SSTables into one.
◦ Major compaction: It is triggered manually using the nodetool. It compacts all SSTables of a column
family into one.
9
Compaction

 Bloom filters are a probabilistic data structure that allows Cassandra to determine one of two
possible states: - The data definitely does not exist in the given file, or - The data probably
exists in the given file.
 It checks if the requested row exists in the SSTable before doing any disk I/O.
 To change the Bloom filter attribute on a column family,
◦ ALTER TABLE addamsFamily WITH bloom_filter_fp_chance = 0.01;
10
Bloom Filter

 It is designed to capture insert, update, and delete activity applied to tables(column families), and to make the
details of the changes available in an easily consumed format.
 CDC logs use the same binary format as the commit log.
 After the disk space limit is reached, CDC-enabled tables reject writes until space is freed.
 Enable CDC logging and configure CDC directories and space in cassandra.yaml.
 cdc_enabled: true
 cdc_total_space_in_mb: 4096
 cdc_free_space_check_interval_ms: 250
 cdc_raw_directory: /var/lib/cassandra/cdc_raw
 To enable CDC logging for a database table
 CREATE TABLE foo (a int, b text, PRIMARY KEY(a)) WITH cdc=true;
 ALTER TABLE <Table_NAme> WITH cdc=true;
11
Change Data Capture

 NODETOOL
 It is a basic tool, bundled in the Cassandra distribution, for node management and statistics gathering.
 Nodetool shows cluster status, compactions, bootstrap streams and much more.
 It is a very important source of information, but it's just a CLI tool without any storage or visualization
capabilities.
12
Monitoring

 JMX & REPORTERS
 Cassandra exposes all its metrics via JMX (by default on port 7199).
 JMX can be read e.g. with jconsole or jvisualvm with VisualVM-MBeans plugin (both tools bundled in JDK
distributions).
 By default remote JMX is disabled. If you really need it, you can enable it in cassandra-env.sh
 DATASTAX OPSCENTER
 It is a monitoring and management solution.
 It is also capable of system monitoring.
 Every node needs to have an OpsCenter agent installed, which sends data to the main OpsCenter service,
which in turn stores them in a Cassandra keyspace.
 It is compatible with the open source Cassandra up to version 2.1.
13
Monitoring

 Eaxmple
14
Spark Cassandra Connector
SparkConf conf = new SparkConf()
.setAppName("My application");
SparkContext sc = new SparkContext(conf);
JavaRDD<Person> personRdd = CassandraJavaUtil.javaFunctions(sc)
.cassandraTable("my_keyspace", "my_table", mapRowTo(Person.class));

Cassandra advanced part-ll

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Cassandra advanced part-ll

Similar to Cassandra advanced part-ll (20)

Cassandra advanced part-ll