SlideShare a Scribd company logo
1 of 15
Cassandra
column-oriented database
Presentation by
Dhivya Ramasamy
Email:achuhivi08@gmail.com
1
 It is a distributed database from Apache .
 It is highly scalable and designed to manage very large amounts of
structured data.
 High availability with no single point of failure.
 It is a column-oriented database
2
Cassandra Overview
Cassandra RDBMS
It is used to deal with unstructured data. It is used to deal with structured data.
Flexible schema Fixed Schema
Relationships are represented using
collections.
In RDBMS, there are concept of foreign keys,
joins etc.
It won’t support Join’s It support Join’s
3
Cassandra Vs RDBMS
 Cassandra is to handle big data workloads across multiple
nodes without any single point of failure.
 Cassandra has peer-to-peer distributed system across its
nodes.
 Data is distributed among all the nodes in a cluster.
Advantages and Applicable Area
 Open Source
 Peer to peer
 High Availability & performance..
4
Cassandra Architecture
 The components of Cassandra data model are keyspaces,
tables, and columns.
 Keyspaces - is the outermost container for data in Cassandra.
◦ no default keyspace
◦ Replication is specified at the keyspace level.

5
Cassandra Data Model
 CQL does not support aggregation queries like max, min, avg
 CQL does not support group by, having queries.
 CQL does not support joins.
 CQL does not support OR queries.
 CQL does not support wildcard queries.
 CQL does not support Union, Intersection queries.
 Table columns cannot be filtered without creating the index.
 Greater than (>) and less than (<) query is only supported on
clustering column.Cassandra query language is not suitable
for analytics purposes because it has so many limitations.
6
Cassandra Query Language (CQL) and cqlsh
 It is the internal communication technique for nodes in a cluster to talk to each other.
 It runs every second for every node and exchange state messages with up to three other nodes in the
cluster.
7
Gossip and Snitching
 Snitch job is to determine which data centers and racks it should use to read data from and write data to.
 Types of Snitches:
 SimpleSnitch
 GossipingPropertyFileSnitch
 PropertyFileSnitch
 Ec2Snitch
 Ec2MultiRegionSnitch
 RackInferringSnitch
8
Gossip and Snitching
 Compaction refers to a maintenance process in Cassandra, in which the SSTables are reorganized for
data optimization of data structures on the disk.
 It is useful during interacting with memtables.
 There are two types of compaction in Cassandra.
◦ Minor compaction: It gets started automatically when a new SSTable is created. Here, Cassandra
condenses all the equally sized SSTables into one.
◦ Major compaction: It is triggered manually using the nodetool. It compacts all SSTables of a column
family into one.
9
Compaction
 Bloom filters are a probabilistic data structure that allows Cassandra to determine one of two
possible states: - The data definitely does not exist in the given file, or - The data probably
exists in the given file.
 It checks if the requested row exists in the SSTable before doing any disk I/O.
 To change the Bloom filter attribute on a column family,
◦ ALTER TABLE addamsFamily WITH bloom_filter_fp_chance = 0.01;
10
Bloom Filter
 It is designed to capture insert, update, and delete activity applied to tables(column families), and to make the
details of the changes available in an easily consumed format.
 CDC logs use the same binary format as the commit log.
 After the disk space limit is reached, CDC-enabled tables reject writes until space is freed.
 Enable CDC logging and configure CDC directories and space in cassandra.yaml.
 cdc_enabled: true
 cdc_total_space_in_mb: 4096
 cdc_free_space_check_interval_ms: 250
 cdc_raw_directory: /var/lib/cassandra/cdc_raw
 To enable CDC logging for a database table
 CREATE TABLE foo (a int, b text, PRIMARY KEY(a)) WITH cdc=true;
 ALTER TABLE <Table_NAme> WITH cdc=true;
11
Change Data Capture
 NODETOOL
 It is a basic tool, bundled in the Cassandra distribution, for node management and statistics gathering.
 Nodetool shows cluster status, compactions, bootstrap streams and much more.
 It is a very important source of information, but it's just a CLI tool without any storage or visualization
capabilities.
12
Monitoring
 JMX & REPORTERS
 Cassandra exposes all its metrics via JMX (by default on port 7199).
 JMX can be read e.g. with jconsole or jvisualvm with VisualVM-MBeans plugin (both tools bundled in JDK
distributions).
 By default remote JMX is disabled. If you really need it, you can enable it in cassandra-env.sh
 DATASTAX OPSCENTER
 It is a monitoring and management solution.
 It is also capable of system monitoring.
 Every node needs to have an OpsCenter agent installed, which sends data to the main OpsCenter service,
which in turn stores them in a Cassandra keyspace.
 It is compatible with the open source Cassandra up to version 2.1.
13
Monitoring
 Eaxmple
14
Spark Cassandra Connector
SparkConf conf = new SparkConf()
.setAppName("My application");
SparkContext sc = new SparkContext(conf);
JavaRDD<Person> personRdd = CassandraJavaUtil.javaFunctions(sc)
.cassandraTable("my_keyspace", "my_table", mapRowTo(Person.class));
15
Thank You !!!

More Related Content

What's hot

Gfs and map redusing
Gfs and map redusingGfs and map redusing
Gfs and map redusingilashanawaz
 
Basic stuff You Need to Know about Cassandra
Basic stuff You Need to Know about CassandraBasic stuff You Need to Know about Cassandra
Basic stuff You Need to Know about CassandraYu-Chang Ho
 
Bigdata and Hadoop
 Bigdata and Hadoop Bigdata and Hadoop
Bigdata and HadoopGirish L
 
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...Kiruthikak14
 
Try Cloud Spanner
Try Cloud SpannerTry Cloud Spanner
Try Cloud SpannerSimon Su
 
Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data eraBill GU
 
MongoDB: Advance concepts - Replication and Sharding
MongoDB: Advance concepts - Replication and ShardingMongoDB: Advance concepts - Replication and Sharding
MongoDB: Advance concepts - Replication and ShardingKnoldus Inc.
 
Apache Cassandra Opinion and Fact
Apache Cassandra Opinion and FactApache Cassandra Opinion and Fact
Apache Cassandra Opinion and Factmediumdata
 
Building a PII scrubbing layer
Building a PII scrubbing layerBuilding a PII scrubbing layer
Building a PII scrubbing layerTilak Patidar
 
Write intensive workloads and lsm trees
Write intensive workloads and lsm treesWrite intensive workloads and lsm trees
Write intensive workloads and lsm treesTilak Patidar
 
The design and implementation of modern column oriented databases
The design and implementation of modern column oriented databasesThe design and implementation of modern column oriented databases
The design and implementation of modern column oriented databasesTilak Patidar
 

What's hot (17)

Gfs and map redusing
Gfs and map redusingGfs and map redusing
Gfs and map redusing
 
Spark
SparkSpark
Spark
 
RDD
RDDRDD
RDD
 
Basic stuff You Need to Know about Cassandra
Basic stuff You Need to Know about CassandraBasic stuff You Need to Know about Cassandra
Basic stuff You Need to Know about Cassandra
 
Bigdata and Hadoop
 Bigdata and Hadoop Bigdata and Hadoop
Bigdata and Hadoop
 
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
 
Try Cloud Spanner
Try Cloud SpannerTry Cloud Spanner
Try Cloud Spanner
 
Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data era
 
MongoDB: Advance concepts - Replication and Sharding
MongoDB: Advance concepts - Replication and ShardingMongoDB: Advance concepts - Replication and Sharding
MongoDB: Advance concepts - Replication and Sharding
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 
Apache Cassandra Opinion and Fact
Apache Cassandra Opinion and FactApache Cassandra Opinion and Fact
Apache Cassandra Opinion and Fact
 
Cppt
CpptCppt
Cppt
 
GOOGLE BIGTABLE
GOOGLE BIGTABLEGOOGLE BIGTABLE
GOOGLE BIGTABLE
 
Building a PII scrubbing layer
Building a PII scrubbing layerBuilding a PII scrubbing layer
Building a PII scrubbing layer
 
Cassandra database design best practises
Cassandra database design best practisesCassandra database design best practises
Cassandra database design best practises
 
Write intensive workloads and lsm trees
Write intensive workloads and lsm treesWrite intensive workloads and lsm trees
Write intensive workloads and lsm trees
 
The design and implementation of modern column oriented databases
The design and implementation of modern column oriented databasesThe design and implementation of modern column oriented databases
The design and implementation of modern column oriented databases
 

Similar to Cassandra advanced part-ll

Cassandra advanced-I
Cassandra advanced-ICassandra advanced-I
Cassandra advanced-Iachudhivi
 
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMCASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMIJCI JOURNAL
 
cassandra
cassandracassandra
cassandraAkash R
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystemAlex Thompson
 
Cassandra - A Distributed Database System
Cassandra - A Distributed Database System Cassandra - A Distributed Database System
Cassandra - A Distributed Database System Md. Shohel Rana
 
Scaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosqlScaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosqlDavid Daeschler
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache CassandraDataStax
 
Dynamo cassandra
Dynamo cassandraDynamo cassandra
Dynamo cassandraWu Liang
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
Comparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbComparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbsonalighai
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for SysadminsNathan Milford
 
Cassandra basics
Cassandra basicsCassandra basics
Cassandra basicsachudhivi
 
Cassandra basics 2.0
Cassandra basics 2.0Cassandra basics 2.0
Cassandra basics 2.0Asis Mohanty
 

Similar to Cassandra advanced part-ll (20)

Cassandra advanced-I
Cassandra advanced-ICassandra advanced-I
Cassandra advanced-I
 
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMCASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
 
cassandra
cassandracassandra
cassandra
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
 
Cassndra (4).pptx
Cassndra (4).pptxCassndra (4).pptx
Cassndra (4).pptx
 
Cassandra - A Distributed Database System
Cassandra - A Distributed Database System Cassandra - A Distributed Database System
Cassandra - A Distributed Database System
 
Cassandra tutorial
Cassandra tutorialCassandra tutorial
Cassandra tutorial
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
Cassandra
CassandraCassandra
Cassandra
 
Cassandra Learning
Cassandra LearningCassandra Learning
Cassandra Learning
 
Scaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosqlScaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosql
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
Dynamo cassandra
Dynamo cassandraDynamo cassandra
Dynamo cassandra
 
Cassandra
CassandraCassandra
Cassandra
 
Cassandra architecture
Cassandra architectureCassandra architecture
Cassandra architecture
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Comparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbComparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsb
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
Cassandra basics
Cassandra basicsCassandra basics
Cassandra basics
 
Cassandra basics 2.0
Cassandra basics 2.0Cassandra basics 2.0
Cassandra basics 2.0
 

Cassandra advanced part-ll

  • 1. Cassandra column-oriented database Presentation by Dhivya Ramasamy Email:achuhivi08@gmail.com 1
  • 2.  It is a distributed database from Apache .  It is highly scalable and designed to manage very large amounts of structured data.  High availability with no single point of failure.  It is a column-oriented database 2 Cassandra Overview
  • 3. Cassandra RDBMS It is used to deal with unstructured data. It is used to deal with structured data. Flexible schema Fixed Schema Relationships are represented using collections. In RDBMS, there are concept of foreign keys, joins etc. It won’t support Join’s It support Join’s 3 Cassandra Vs RDBMS
  • 4.  Cassandra is to handle big data workloads across multiple nodes without any single point of failure.  Cassandra has peer-to-peer distributed system across its nodes.  Data is distributed among all the nodes in a cluster. Advantages and Applicable Area  Open Source  Peer to peer  High Availability & performance.. 4 Cassandra Architecture
  • 5.  The components of Cassandra data model are keyspaces, tables, and columns.  Keyspaces - is the outermost container for data in Cassandra. ◦ no default keyspace ◦ Replication is specified at the keyspace level.  5 Cassandra Data Model
  • 6.  CQL does not support aggregation queries like max, min, avg  CQL does not support group by, having queries.  CQL does not support joins.  CQL does not support OR queries.  CQL does not support wildcard queries.  CQL does not support Union, Intersection queries.  Table columns cannot be filtered without creating the index.  Greater than (>) and less than (<) query is only supported on clustering column.Cassandra query language is not suitable for analytics purposes because it has so many limitations. 6 Cassandra Query Language (CQL) and cqlsh
  • 7.  It is the internal communication technique for nodes in a cluster to talk to each other.  It runs every second for every node and exchange state messages with up to three other nodes in the cluster. 7 Gossip and Snitching
  • 8.  Snitch job is to determine which data centers and racks it should use to read data from and write data to.  Types of Snitches:  SimpleSnitch  GossipingPropertyFileSnitch  PropertyFileSnitch  Ec2Snitch  Ec2MultiRegionSnitch  RackInferringSnitch 8 Gossip and Snitching
  • 9.  Compaction refers to a maintenance process in Cassandra, in which the SSTables are reorganized for data optimization of data structures on the disk.  It is useful during interacting with memtables.  There are two types of compaction in Cassandra. ◦ Minor compaction: It gets started automatically when a new SSTable is created. Here, Cassandra condenses all the equally sized SSTables into one. ◦ Major compaction: It is triggered manually using the nodetool. It compacts all SSTables of a column family into one. 9 Compaction
  • 10.  Bloom filters are a probabilistic data structure that allows Cassandra to determine one of two possible states: - The data definitely does not exist in the given file, or - The data probably exists in the given file.  It checks if the requested row exists in the SSTable before doing any disk I/O.  To change the Bloom filter attribute on a column family, ◦ ALTER TABLE addamsFamily WITH bloom_filter_fp_chance = 0.01; 10 Bloom Filter
  • 11.  It is designed to capture insert, update, and delete activity applied to tables(column families), and to make the details of the changes available in an easily consumed format.  CDC logs use the same binary format as the commit log.  After the disk space limit is reached, CDC-enabled tables reject writes until space is freed.  Enable CDC logging and configure CDC directories and space in cassandra.yaml.  cdc_enabled: true  cdc_total_space_in_mb: 4096  cdc_free_space_check_interval_ms: 250  cdc_raw_directory: /var/lib/cassandra/cdc_raw  To enable CDC logging for a database table  CREATE TABLE foo (a int, b text, PRIMARY KEY(a)) WITH cdc=true;  ALTER TABLE <Table_NAme> WITH cdc=true; 11 Change Data Capture
  • 12.  NODETOOL  It is a basic tool, bundled in the Cassandra distribution, for node management and statistics gathering.  Nodetool shows cluster status, compactions, bootstrap streams and much more.  It is a very important source of information, but it's just a CLI tool without any storage or visualization capabilities. 12 Monitoring
  • 13.  JMX & REPORTERS  Cassandra exposes all its metrics via JMX (by default on port 7199).  JMX can be read e.g. with jconsole or jvisualvm with VisualVM-MBeans plugin (both tools bundled in JDK distributions).  By default remote JMX is disabled. If you really need it, you can enable it in cassandra-env.sh  DATASTAX OPSCENTER  It is a monitoring and management solution.  It is also capable of system monitoring.  Every node needs to have an OpsCenter agent installed, which sends data to the main OpsCenter service, which in turn stores them in a Cassandra keyspace.  It is compatible with the open source Cassandra up to version 2.1. 13 Monitoring
  • 14.  Eaxmple 14 Spark Cassandra Connector SparkConf conf = new SparkConf() .setAppName("My application"); SparkContext sc = new SparkContext(conf); JavaRDD<Person> personRdd = CassandraJavaUtil.javaFunctions(sc) .cassandraTable("my_keyspace", "my_table", mapRowTo(Person.class));