Introduction to Cassandra
Presented on 26th Feb 2014
Scope
• Introduction to Cassandra and NoSql
• Understanding Cassandra data model
• Configuration, read and writing data in Cassandra
• CQL

2
What is Cassandra
• A Database
• Uses Amazon’s Dyanamo’s fully distribution design
• Uses Google’s BigTable’s column family based data model
• Developed by Facebook (The team was led by Jeff Hammerbacher,
with Avinash Lakshman, Karthik Ranganathan, and Prashant Malik
(Search Team))
• Open source in 2008

3
Problems with RDBMS
• Horizontal scaling: In RDBMS as the size grows the joins become
slows so the retrieval become slow.
• Vertical scaling: adding more hardware, memory, faster processor or
upgrading disk space. Adding hardware creates problem like data
replication, consistency, fail over mechanism.
• Caching layer in large system: like memcache, EHCache, Oracle
Coherence. Updation in the cache and data base is exacerbated over
a cluster.

4
Cassandra
• Apache Cassandra is an open source, distributed, decentralized,
elastically scalable, highly available, fault-tolerant, tuneable
consistent, column-oriented database that bases its distribution
design on Amazon’s Dynamo and its data model on Google’s
Bigtable. Created at Facebook.”

5
Why Cassandra
• Fault tolerant
• Decentralized
• Eventually consistent
• Rich data model
• Elastic
• Highly Available
• No SPF (Single point failure)

6
Cap theorem
• University of California at Berkeley, Eric Brewer posted his CAP theorem in 2000.
• The theorem states that within a large-scale distributed data system, there are three
requirements that have a relationship of sliding dependency.
• Consistency: All database clients will read the same value for the same query, even given
concurrent updates.
• Availability: All database clients will always be able to read and write data.
• Partition Tolerance: The database can be split into multiple machines; it can continue functioning
in the face of network segmentation breaks.

7
Cap theorem (cont.)
• According to theorem only two of the three can be strongly supported distributed data
system

• CA: it means system will block when the system will partitions. so in this the system is been
limited to a single data centre to mitigate this.
• CP: it allow data sharding in order to data scaling. The data will be consistent but data may
loss whenever a node goes down.
• AP: system may return inaccurate data, but the system will always be available, even in the
face of network partitioning. DNS is perhaps the most popular example of a system that is
massively scalable, highly available, and partition-tolerant.

8
9
Fault Tolerant
• Data is automatically replicated to multiple nodes based on
replication factor.
• Replication across multiple data center
• Failed nodes can be replaced with no downtime.
• Uses Accrual Failure Detector for fault detection.

10
Decentralization
• Every node in the cluster is identical (No client server architecture)
• There is no single points of failure.

11
Eventual consistency
• Uses BASE (Basically Available Soft-state Eventual) Consistency.
• As the data is replicated, the latest version of something is sitting on
at least one node in the cluster, but old version will still be on other
node.
• Eventually all nodes will see the latest version.

12
Eventual consistency (Cont.)
• Tuneable Consistency: a replication factor to the number of nodes in
the cluster you want the updates to propagate to.
• Consistency level is a setting that clients must specify on every
operation and that allows you to decide how many replicas in the
cluster must acknowledge a write operation or respond to a read
operation in order to be considered successful. That’s the part where
Cassandra has pushed the decision for determining consistency out to
the client. so strict consistency can be achieved assigning same value
to replication factor and consistency level.

13
Rich Data Model
• Keyspace
• Column family
• Rows
• Column
• Super column

14
Column family
"ToyStore" : {
"Toys" : {
"GumDrop" : {
"Price" : "0.25",
"Section" : "Candy"
}
"Transformer" : {
"Price" : "29.99",
"Section" : "Action
Figures"
}
"MatchboxCar" : {
"Price" : "1.49",
"Section" : "Vehicles"
}
}
},
"Keyspace1" : null,
"system" : null

15
Super Column family

16
"ToyCorporation" : {
"ToyStores" : {
"Ohio Store" : {
"Transformer" : {"Price" : "29.99", "Section" : "Action Figures"}
"GumDrop" : {"Price" : "0.25","Section" : "Candy"}
"MatchboxCar" : {"Price" : "1.49","Section" : "Vehicles"}
}
"New York Store" : {
"JawBreaker" : {"Price" : "4.25","Section" : "Candy"}
"MatchboxCar" : {"Price" : "8.79","Section" : "Vehicles"}
}
}
}

17
Keyspace
It is similar as we have schema in RDBMS, it contains a name and a set
of attributes that defines keyspace wide behaviour.
various attributes are:
1. Replication factor: if it is set to 3 then 3 nodes will be having
the copy of each row.
2. Replica placement strategy: like SimpleStrategy
(RackUnawareStrategy), OldNetworkTopologyStrategy (RackAwareStrategy), and NetworkTopologyStrategy (DatacenterShardStrategy).
3. Column family: will discussed.
18
Column family
• A column family is a container for columns, analogous to the table in
a relational system.
• A Column family holds an ordered list of columns, which is been
refered by the column name.

• [Keyspace][ColumnFamily][Key][Column]

19
Column family (cont.)
column family has two attributes: a name and a comparator.
comparator indicate the sorting order when they are returns against a
query. comparator can be of following types: AsciiType, BytesType,
LexicalUUIDType, IntegerType, LongType, TimeUUIDType, or UTF8Type,
Custom (plug your class to cassandra which should be extending
org.apache.cassandra.db.marshal.AbstractType)

20
Column family (cont.)
Hotel {
• key: AZC_043 { name: Cambria Suites Hayden, phone: 480-444-4444,

address: 400 N. Hayden Rd., city: Scottsdale, state: AZ, zip: 85255}
• key: AZS_011 { name: Clarion Scottsdale Peak, phone: 480-333-3333,
address: 3000 N. Scottsdale Rd, city: Scottsdale, state: AZ, zip: 85255}
• key: CAS_021 { name: W Hotel, phone: 415-222-2222,

address: 181 3rd Street, city: San Francisco, state: CA, zip: 94103}
• key: NYN_042 { name: Waldorf Hotel, phone: 212-555-5555,
address: 301 Park Ave, city: New York, state: NY, zip: 10019}
}

21
Rows
• Cassandra is column-oriented database. each row doesn’t have to
have a same number of columns (as in relational database). Each row
has a unique key, which makes it data accessible.
• Each column family is stored in a separate file.

22
Columns
• The column, which is a name/value pair (and a client-supplied
timestamp of when it was last updated), and a column family, which
is a container for rows that have similar, but not identical, column
sets. each column has an extra column for time stamp which records
the time when last column was last updated. rows does not have
timestamp
• columns are name/value pairs, but a regular column stores a byte
array value

23
Super column
• The value of a super column is a map of subcolumns (which store
byte array values).
• it’s important to keep columns that you are likely to query together in
the same column family, and a super column can be helpful for this.
• Super columns are not indexed.
• Cassandra looks like a four-dimensional hash table. But for super
columns, it becomes more like a five-dimensional hash:
[Keyspace][ColumnFamily][Key][SuperColumn][SubColumn]

24
Some points
• You cannot perform joins in Cassandra. If you have designed a data
model and find that you need a join, you’ll have to either do the work
on the client side, or create a denormalized second column family
that represents the join results for you.
• It is not possible to sort by value, it can only sort by column name in
order to fetch individual columns from a rows without pulling entire
row into memory.
• Column sorting is controllable, but key sorting isn’t row keys always
sort in byte order.

25
Elastic/Highly Avaliable
• Read and write throughput both increase linearly as new machine are
added.
• No downtime or interruption to application.

26
Sharding basic strategies
• feature base or functional segmentation: sharding will feature based
with no common features like user details and items for sale will be
different shards, movie rating and comments will be in different
shards.
• key based sharding: a key in data that will evenly distribute it across
shards. So instead of simply storing one letter of the alphabet for
each server as in the (naive and improper) earlier example, you use a
one-way hash on a key data element and distribute data across
machines according to the hash.
• lookup table: a table with contain information regarding the location
of the actual data.
27
Design Pattern
1. Materialized View (one table per query): create a secondary index to
represent the additional query. “materialized” means storing a full
copy of the original data so that everything you need to answer a
query is right there, without forcing you to look up the original data.
If you are performing a second query because you’re only storing
column names that you use, like foreign keys in the second column
family, that’s a secondary index.

28
Design Pattern (Cont.)
2. Valueless column: storing column value as column name. like in
user/usercity we can have city name as key and users of that city as
column names.
3. Aggregate key: key should be unique so it is possible to add two
column value with a separator to create a aggregate key.

29
Reference
• Assembled using various resources over internet.
Thank You

Introduction to cassandra

  • 1.
  • 2.
    Scope • Introduction toCassandra and NoSql • Understanding Cassandra data model • Configuration, read and writing data in Cassandra • CQL 2
  • 3.
    What is Cassandra •A Database • Uses Amazon’s Dyanamo’s fully distribution design • Uses Google’s BigTable’s column family based data model • Developed by Facebook (The team was led by Jeff Hammerbacher, with Avinash Lakshman, Karthik Ranganathan, and Prashant Malik (Search Team)) • Open source in 2008 3
  • 4.
    Problems with RDBMS •Horizontal scaling: In RDBMS as the size grows the joins become slows so the retrieval become slow. • Vertical scaling: adding more hardware, memory, faster processor or upgrading disk space. Adding hardware creates problem like data replication, consistency, fail over mechanism. • Caching layer in large system: like memcache, EHCache, Oracle Coherence. Updation in the cache and data base is exacerbated over a cluster. 4
  • 5.
    Cassandra • Apache Cassandrais an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneable consistent, column-oriented database that bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable. Created at Facebook.” 5
  • 6.
    Why Cassandra • Faulttolerant • Decentralized • Eventually consistent • Rich data model • Elastic • Highly Available • No SPF (Single point failure) 6
  • 7.
    Cap theorem • Universityof California at Berkeley, Eric Brewer posted his CAP theorem in 2000. • The theorem states that within a large-scale distributed data system, there are three requirements that have a relationship of sliding dependency. • Consistency: All database clients will read the same value for the same query, even given concurrent updates. • Availability: All database clients will always be able to read and write data. • Partition Tolerance: The database can be split into multiple machines; it can continue functioning in the face of network segmentation breaks. 7
  • 8.
    Cap theorem (cont.) •According to theorem only two of the three can be strongly supported distributed data system • CA: it means system will block when the system will partitions. so in this the system is been limited to a single data centre to mitigate this. • CP: it allow data sharding in order to data scaling. The data will be consistent but data may loss whenever a node goes down. • AP: system may return inaccurate data, but the system will always be available, even in the face of network partitioning. DNS is perhaps the most popular example of a system that is massively scalable, highly available, and partition-tolerant. 8
  • 9.
  • 10.
    Fault Tolerant • Datais automatically replicated to multiple nodes based on replication factor. • Replication across multiple data center • Failed nodes can be replaced with no downtime. • Uses Accrual Failure Detector for fault detection. 10
  • 11.
    Decentralization • Every nodein the cluster is identical (No client server architecture) • There is no single points of failure. 11
  • 12.
    Eventual consistency • UsesBASE (Basically Available Soft-state Eventual) Consistency. • As the data is replicated, the latest version of something is sitting on at least one node in the cluster, but old version will still be on other node. • Eventually all nodes will see the latest version. 12
  • 13.
    Eventual consistency (Cont.) •Tuneable Consistency: a replication factor to the number of nodes in the cluster you want the updates to propagate to. • Consistency level is a setting that clients must specify on every operation and that allows you to decide how many replicas in the cluster must acknowledge a write operation or respond to a read operation in order to be considered successful. That’s the part where Cassandra has pushed the decision for determining consistency out to the client. so strict consistency can be achieved assigning same value to replication factor and consistency level. 13
  • 14.
    Rich Data Model •Keyspace • Column family • Rows • Column • Super column 14
  • 15.
    Column family "ToyStore" :{ "Toys" : { "GumDrop" : { "Price" : "0.25", "Section" : "Candy" } "Transformer" : { "Price" : "29.99", "Section" : "Action Figures" } "MatchboxCar" : { "Price" : "1.49", "Section" : "Vehicles" } } }, "Keyspace1" : null, "system" : null 15
  • 16.
  • 17.
    "ToyCorporation" : { "ToyStores": { "Ohio Store" : { "Transformer" : {"Price" : "29.99", "Section" : "Action Figures"} "GumDrop" : {"Price" : "0.25","Section" : "Candy"} "MatchboxCar" : {"Price" : "1.49","Section" : "Vehicles"} } "New York Store" : { "JawBreaker" : {"Price" : "4.25","Section" : "Candy"} "MatchboxCar" : {"Price" : "8.79","Section" : "Vehicles"} } } } 17
  • 18.
    Keyspace It is similaras we have schema in RDBMS, it contains a name and a set of attributes that defines keyspace wide behaviour. various attributes are: 1. Replication factor: if it is set to 3 then 3 nodes will be having the copy of each row. 2. Replica placement strategy: like SimpleStrategy (RackUnawareStrategy), OldNetworkTopologyStrategy (RackAwareStrategy), and NetworkTopologyStrategy (DatacenterShardStrategy). 3. Column family: will discussed. 18
  • 19.
    Column family • Acolumn family is a container for columns, analogous to the table in a relational system. • A Column family holds an ordered list of columns, which is been refered by the column name. • [Keyspace][ColumnFamily][Key][Column] 19
  • 20.
    Column family (cont.) columnfamily has two attributes: a name and a comparator. comparator indicate the sorting order when they are returns against a query. comparator can be of following types: AsciiType, BytesType, LexicalUUIDType, IntegerType, LongType, TimeUUIDType, or UTF8Type, Custom (plug your class to cassandra which should be extending org.apache.cassandra.db.marshal.AbstractType) 20
  • 21.
    Column family (cont.) Hotel{ • key: AZC_043 { name: Cambria Suites Hayden, phone: 480-444-4444, address: 400 N. Hayden Rd., city: Scottsdale, state: AZ, zip: 85255} • key: AZS_011 { name: Clarion Scottsdale Peak, phone: 480-333-3333, address: 3000 N. Scottsdale Rd, city: Scottsdale, state: AZ, zip: 85255} • key: CAS_021 { name: W Hotel, phone: 415-222-2222, address: 181 3rd Street, city: San Francisco, state: CA, zip: 94103} • key: NYN_042 { name: Waldorf Hotel, phone: 212-555-5555, address: 301 Park Ave, city: New York, state: NY, zip: 10019} } 21
  • 22.
    Rows • Cassandra iscolumn-oriented database. each row doesn’t have to have a same number of columns (as in relational database). Each row has a unique key, which makes it data accessible. • Each column family is stored in a separate file. 22
  • 23.
    Columns • The column,which is a name/value pair (and a client-supplied timestamp of when it was last updated), and a column family, which is a container for rows that have similar, but not identical, column sets. each column has an extra column for time stamp which records the time when last column was last updated. rows does not have timestamp • columns are name/value pairs, but a regular column stores a byte array value 23
  • 24.
    Super column • Thevalue of a super column is a map of subcolumns (which store byte array values). • it’s important to keep columns that you are likely to query together in the same column family, and a super column can be helpful for this. • Super columns are not indexed. • Cassandra looks like a four-dimensional hash table. But for super columns, it becomes more like a five-dimensional hash: [Keyspace][ColumnFamily][Key][SuperColumn][SubColumn] 24
  • 25.
    Some points • Youcannot perform joins in Cassandra. If you have designed a data model and find that you need a join, you’ll have to either do the work on the client side, or create a denormalized second column family that represents the join results for you. • It is not possible to sort by value, it can only sort by column name in order to fetch individual columns from a rows without pulling entire row into memory. • Column sorting is controllable, but key sorting isn’t row keys always sort in byte order. 25
  • 26.
    Elastic/Highly Avaliable • Readand write throughput both increase linearly as new machine are added. • No downtime or interruption to application. 26
  • 27.
    Sharding basic strategies •feature base or functional segmentation: sharding will feature based with no common features like user details and items for sale will be different shards, movie rating and comments will be in different shards. • key based sharding: a key in data that will evenly distribute it across shards. So instead of simply storing one letter of the alphabet for each server as in the (naive and improper) earlier example, you use a one-way hash on a key data element and distribute data across machines according to the hash. • lookup table: a table with contain information regarding the location of the actual data. 27
  • 28.
    Design Pattern 1. MaterializedView (one table per query): create a secondary index to represent the additional query. “materialized” means storing a full copy of the original data so that everything you need to answer a query is right there, without forcing you to look up the original data. If you are performing a second query because you’re only storing column names that you use, like foreign keys in the second column family, that’s a secondary index. 28
  • 29.
    Design Pattern (Cont.) 2.Valueless column: storing column value as column name. like in user/usercity we can have city name as key and users of that city as column names. 3. Aggregate key: key should be unique so it is possible to add two column value with a separator to create a aggregate key. 29
  • 30.
    Reference • Assembled usingvarious resources over internet.
  • 31.