(
NoSQL DataBase
-
Column-Oriented
)
Ehsan Javanmard
Javanmard.ehsan@gmail.com
Introduction to NoSQL Databases
• Rigid (fixed schema)
• Stable Data
• Structured Data
• Flexible (schema-free)
• Scalable Data
• Unstructured Data (Semi-Structured)
• SQL databases store data in form of tables, rows, columns and records.
• This data is stored in a pre-defined data model.
Structured Query Language
Apps/Devices
Logs
&
Sensors Data
Extract load
transform
NoSQL database
A NoSQL database is a database that provides a mechanism to store and retrieve data
other than the tabular relations used in relational databases.
These databases are:
• schema-free
• support easy replication
• have simple API
• horizontal scaling
• Consistent or eventually consistent
• and can handle huge amounts of data
Relational Database NoSql Database
Supports powerful query language. Supports very simple query language.
It has a fixed schema. No fixed schema.
Follows ACID (Atomicity, Consistency,
Isolation, and Durability).
It is only “eventually consistent”.
Supports transactions. Does not support transactions.
Schema Free
Scaling
Replication
CAP theorem
CAP theorem(Eric Brewer)
New article: CAP Twelve Years Later: How the "Rules" Have Changed
Types of NoSQL databases:
• Document Oriented
• Key Value
• Column Oriented
• Graph
Apache Cassandra is an open source, high-performance
decentralized/distributed storage system (NoSQL database) from
Apache that is highly scalable and designed to manage very large
amounts of data spread out across the world.
It provides highly Available/fault-tolerant service with no single point of
failure(a part of a system that, if it fails, will stop the entire system from working).
• based on Amazon’s Dynamo and its data model on
Google’s Bigtable.
• Cassandra implements a Dynamo-style replication
model with no single point of failure, but adds a more
powerful “column family” data model.
• It was open-sourced by Facebook in July 2008.
• Cassandra was accepted into Apache Incubator in
March 2009.
• It was made an Apache top-level project since
February 2010.
What kind of NoSQL database Cassandra is?
Cassandra is a Column oriented database.
columnar databases store data in columns rather than in rows. These databases provide high data compression
rates, thus saving disk space and leading to faster query execution.
Cassandra is being used by some of the biggest companies such as:
• Apple: with over 75,000 nodes storing over 10 PB of data
• Netflix: 2,500 nodes, 420 TB, over 1 trillion requests per day
• eBay: over 100 nodes, 250 TB
Features of
Cassandra:
Cassandra accommodates all
possible data formats including:
structured, semi-structured, and
unstructured.
Cassandra is highly scalable; Read and write throughput
both increase linearly as new machines are added, with
no downtime or interruption to applications.
Cassandra provides the flexibility to
distribute data where you need by
replicating data across multiple data
centers.
Cassandra performs fast writes and can store
hundreds of terabytes of data, without
sacrificing the read efficiency.
Cassandra has no single
point of failure and it is
continuously available
for business-critical
applications that cannot
afford a failure.
(CQL)
Components of Cassandra:
Components of Cassandra:
•Node
•Data center
•Cluster
•Commit log
•Mem-table
•SSTable
•Bloom filter
Components of Cassandra:
• Node − It is the place where data is stored.
• Data center − It is a collection of related nodes.
• Cluster − A cluster is a component that contains one or more data centers.
• Nodes are arranged in ring format.
• Peer to peer communication protocol.
• No master-slave concept, and hence no single point of failure.
• There are no network bottlenecks.
• All the nodes in a cluster play the same role. Each node is independent and at the same
time interconnected to other nodes.
• Each node in a cluster can accept read and write requests, regardless of where the data
is actually located in the cluster.
• nodes communicate with each other and detect any faulty nodes in the cluster.
• When a node goes down, read/write requests can be served from other nodes in the
network.
Cassandra Architecture
Client
• the cluster does not have a master node, so any read and write can be
handled by any node in the cluster.
• Cassandra is known to have ‘masterless’ architecture because it treats all
nodes the same.
Cassandra Architecture
Data is replicated to multiple nodes.
Replication
Client
Replication
Replication factor (RF)
describes how many copies of your data exist.
Replication Factor = 3
Write/Read Consistency Levels
• One
• Two
• Three
• Quorum
• All
Cluster Size: 8
Replication Factor = 3
Write Consistency Level: 3
Read Consistency Level: 3
Coordinator plays a proxy between the client and the nodes holding the data.
Node
1
Node
2
Node
3
Node
4
Node
5
Node
6
Node
8
Node7
Coordinator
Strong Consistent vs eventual consistent
Write
Consistency
Level
Read
Consistency
Level
https://www.ecyrd.com/cassandracalculator/
Fault tolerance
Components of Cassandra: (cont.)
• Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write operation is written
to the commit log.
• Mem-table − A mem-table is a memory-resident data structure. After commit log, the data will be written to
the mem-table.
• SSTable − It is a disk file to which the data is flushed from the mem-table when its contents reach a threshold
value.
• Bloom filter − These are nothing but quick, nondeterministic, algorithms for testing whether an element is a
member of a set. Cassandra uses Bloom filters to determine whether an SSTable has data for a particular row.
Write and Read Operations
Write Operations: Every write activity of nodes is captured by the commit logs written in the nodes.
Later the data will be captured and stored in the mem-table. Whenever the mem-table is full, data will be
written into the SStable data file. All writes are automatically partitioned and replicated throughout the
cluster.
Cassandra periodically consolidates the SSTables, discarding unnecessary data.
Read Operations: During read operations, Cassandra gets values from the mem-table and checks the
bloom filter to find the appropriate SSTable that holds the required data.
Cassandra Query Language (CQL)
Users can access Cassandra through its nodes using Cassandra Query Language (CQL).
• Programmers use Cassandra query language shell ”cqlsh”: a prompt to
work with CQL
• or separate application language drivers.
Cassandra Data Model
• Cluster
• Keyspace
• Column family
Cassandra Data Model
• Cluster
Cassandra database is distributed over several machines that operate together. Cassandra
arranges the nodes in a cluster, in a ring format, and assigns data to them.
• Keyspace
• Column family
Cassandra Data Model
• Cluster
• Keyspace
is the outermost container for data in Cassandra. Keyspace is a container for a list of one or more
column families.
• Replication factor: the number of machines in the cluster that will receive copies of the same data.
• Replica placement strategy: It is nothing but the strategy to place replicas in the ring. such as:
• simple strategy
• network topology strategy (datacenter-shared strategy).
• Column family
CREATE KEYSPACE Keyspace name
WITH replication = {'class': 'SimpleStrategy', 'replication_factor'
: 3};
Cassandra Data Model
• Cluster
• Keyspace
• Column family
is a container for an ordered collection of rows. Each row, in turn, is an ordered collection of columns.
Column families represent the structure of your data.
Each keyspace has at least one and often many column families.
Each column family can be compared to a container of rows in an RDBMS table.
The difference is that various rows do not have to have the same columns.
column
A column is the basic data structure of Cassandra.
Cassandra does not force individual rows to have all the columns.
SuperColumn
A super column is a special column. It stores a map of sub-columns.
Data Model comparison between RDBMS and Cassandra
RDBMS Cassandra
RDBMS deals with structured data. Cassandra deals with unstructured data.
It has a fixed schema. Cassandra has a flexible schema.
In RDBMS, a table is an array of arrays. (ROW x
COLUMN)
In Cassandra, a table is a list of “nested key-value
pairs”. (ROW x COLUMN key x COLUMN value)
Database is the outermost container that contains
data corresponding to an application.
Keyspace is the outermost container that contains
data corresponding to an application.
Tables are the entities of a database. Tables or column families are the entity of a keyspace.
Row is an individual record in RDBMS. Row is a unit of replication in Cassandra.
Column represents the attributes of a relation. Column is a unit of storage in Cassandra.
RDBMS supports the concepts of foreign keys, joins. Relationships are represented using collections.
Installing Cassandra
Data Type Constants Description
ascii strings Represents ASCII character string
bigint bigint Represents 64-bit signed long
blob blobs Represents arbitrary bytes
Boolean booleans Represents true or false
counter integers Represents counter column
decimal integers, floats Represents variable-precision decimal
double integers Represents 64-bit IEEE-754 floating point
float integers, floats Represents 32-bit IEEE-754 floating point
inet strings Represents an IP address, IPv4 or IPv6
int integers Represents 32-bit signed int
text strings Represents UTF8 encoded string
timestamp integers, strings Represents a timestamp
timeuuid uuids Represents type 1 UUID
uuid uuids Represents type 1 or type 4 UUID
varchar strings Represents uTF8 encoded string
varint integers Represents arbitrary-precision integer
http://cassandra.apache.org/
Good Luck
Ehsan Javanmard
Javanmard.ehsan@gmail.com

Cassandra Learning

  • 1.
  • 2.
  • 3.
    • Rigid (fixedschema) • Stable Data • Structured Data • Flexible (schema-free) • Scalable Data • Unstructured Data (Semi-Structured)
  • 4.
    • SQL databasesstore data in form of tables, rows, columns and records. • This data is stored in a pre-defined data model.
  • 6.
  • 10.
  • 13.
  • 15.
    NoSQL database A NoSQLdatabase is a database that provides a mechanism to store and retrieve data other than the tabular relations used in relational databases. These databases are: • schema-free • support easy replication • have simple API • horizontal scaling • Consistent or eventually consistent • and can handle huge amounts of data
  • 16.
    Relational Database NoSqlDatabase Supports powerful query language. Supports very simple query language. It has a fixed schema. No fixed schema. Follows ACID (Atomicity, Consistency, Isolation, and Durability). It is only “eventually consistent”. Supports transactions. Does not support transactions.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
    CAP theorem(Eric Brewer) Newarticle: CAP Twelve Years Later: How the "Rules" Have Changed
  • 25.
    Types of NoSQLdatabases: • Document Oriented • Key Value • Column Oriented • Graph
  • 29.
    Apache Cassandra isan open source, high-performance decentralized/distributed storage system (NoSQL database) from Apache that is highly scalable and designed to manage very large amounts of data spread out across the world. It provides highly Available/fault-tolerant service with no single point of failure(a part of a system that, if it fails, will stop the entire system from working).
  • 30.
    • based onAmazon’s Dynamo and its data model on Google’s Bigtable. • Cassandra implements a Dynamo-style replication model with no single point of failure, but adds a more powerful “column family” data model. • It was open-sourced by Facebook in July 2008. • Cassandra was accepted into Apache Incubator in March 2009. • It was made an Apache top-level project since February 2010.
  • 31.
    What kind ofNoSQL database Cassandra is? Cassandra is a Column oriented database. columnar databases store data in columns rather than in rows. These databases provide high data compression rates, thus saving disk space and leading to faster query execution.
  • 32.
    Cassandra is beingused by some of the biggest companies such as: • Apple: with over 75,000 nodes storing over 10 PB of data • Netflix: 2,500 nodes, 420 TB, over 1 trillion requests per day • eBay: over 100 nodes, 250 TB
  • 33.
    Features of Cassandra: Cassandra accommodatesall possible data formats including: structured, semi-structured, and unstructured. Cassandra is highly scalable; Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications. Cassandra provides the flexibility to distribute data where you need by replicating data across multiple data centers. Cassandra performs fast writes and can store hundreds of terabytes of data, without sacrificing the read efficiency. Cassandra has no single point of failure and it is continuously available for business-critical applications that cannot afford a failure. (CQL)
  • 34.
  • 35.
    Components of Cassandra: •Node •Datacenter •Cluster •Commit log •Mem-table •SSTable •Bloom filter
  • 36.
    Components of Cassandra: •Node − It is the place where data is stored. • Data center − It is a collection of related nodes. • Cluster − A cluster is a component that contains one or more data centers.
  • 38.
    • Nodes arearranged in ring format. • Peer to peer communication protocol. • No master-slave concept, and hence no single point of failure. • There are no network bottlenecks.
  • 39.
    • All thenodes in a cluster play the same role. Each node is independent and at the same time interconnected to other nodes. • Each node in a cluster can accept read and write requests, regardless of where the data is actually located in the cluster. • nodes communicate with each other and detect any faulty nodes in the cluster. • When a node goes down, read/write requests can be served from other nodes in the network. Cassandra Architecture Client
  • 40.
    • the clusterdoes not have a master node, so any read and write can be handled by any node in the cluster. • Cassandra is known to have ‘masterless’ architecture because it treats all nodes the same. Cassandra Architecture
  • 41.
    Data is replicatedto multiple nodes. Replication Client
  • 42.
  • 43.
    Replication factor (RF) describeshow many copies of your data exist. Replication Factor = 3
  • 44.
    Write/Read Consistency Levels •One • Two • Three • Quorum • All Cluster Size: 8 Replication Factor = 3 Write Consistency Level: 3 Read Consistency Level: 3 Coordinator plays a proxy between the client and the nodes holding the data. Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 8 Node7 Coordinator
  • 45.
    Strong Consistent vseventual consistent Write Consistency Level Read Consistency Level
  • 46.
  • 47.
  • 48.
    Components of Cassandra:(cont.) • Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log. • Mem-table − A mem-table is a memory-resident data structure. After commit log, the data will be written to the mem-table. • SSTable − It is a disk file to which the data is flushed from the mem-table when its contents reach a threshold value. • Bloom filter − These are nothing but quick, nondeterministic, algorithms for testing whether an element is a member of a set. Cassandra uses Bloom filters to determine whether an SSTable has data for a particular row.
  • 49.
    Write and ReadOperations Write Operations: Every write activity of nodes is captured by the commit logs written in the nodes. Later the data will be captured and stored in the mem-table. Whenever the mem-table is full, data will be written into the SStable data file. All writes are automatically partitioned and replicated throughout the cluster. Cassandra periodically consolidates the SSTables, discarding unnecessary data. Read Operations: During read operations, Cassandra gets values from the mem-table and checks the bloom filter to find the appropriate SSTable that holds the required data.
  • 50.
    Cassandra Query Language(CQL) Users can access Cassandra through its nodes using Cassandra Query Language (CQL). • Programmers use Cassandra query language shell ”cqlsh”: a prompt to work with CQL • or separate application language drivers.
  • 51.
    Cassandra Data Model •Cluster • Keyspace • Column family
  • 52.
    Cassandra Data Model •Cluster Cassandra database is distributed over several machines that operate together. Cassandra arranges the nodes in a cluster, in a ring format, and assigns data to them. • Keyspace • Column family
  • 53.
    Cassandra Data Model •Cluster • Keyspace is the outermost container for data in Cassandra. Keyspace is a container for a list of one or more column families. • Replication factor: the number of machines in the cluster that will receive copies of the same data. • Replica placement strategy: It is nothing but the strategy to place replicas in the ring. such as: • simple strategy • network topology strategy (datacenter-shared strategy). • Column family CREATE KEYSPACE Keyspace name WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};
  • 54.
    Cassandra Data Model •Cluster • Keyspace • Column family is a container for an ordered collection of rows. Each row, in turn, is an ordered collection of columns. Column families represent the structure of your data. Each keyspace has at least one and often many column families. Each column family can be compared to a container of rows in an RDBMS table. The difference is that various rows do not have to have the same columns.
  • 55.
    column A column isthe basic data structure of Cassandra. Cassandra does not force individual rows to have all the columns.
  • 56.
    SuperColumn A super columnis a special column. It stores a map of sub-columns.
  • 57.
    Data Model comparisonbetween RDBMS and Cassandra RDBMS Cassandra RDBMS deals with structured data. Cassandra deals with unstructured data. It has a fixed schema. Cassandra has a flexible schema. In RDBMS, a table is an array of arrays. (ROW x COLUMN) In Cassandra, a table is a list of “nested key-value pairs”. (ROW x COLUMN key x COLUMN value) Database is the outermost container that contains data corresponding to an application. Keyspace is the outermost container that contains data corresponding to an application. Tables are the entities of a database. Tables or column families are the entity of a keyspace. Row is an individual record in RDBMS. Row is a unit of replication in Cassandra. Column represents the attributes of a relation. Column is a unit of storage in Cassandra. RDBMS supports the concepts of foreign keys, joins. Relationships are represented using collections.
  • 58.
  • 59.
    Data Type ConstantsDescription ascii strings Represents ASCII character string bigint bigint Represents 64-bit signed long blob blobs Represents arbitrary bytes Boolean booleans Represents true or false counter integers Represents counter column decimal integers, floats Represents variable-precision decimal double integers Represents 64-bit IEEE-754 floating point float integers, floats Represents 32-bit IEEE-754 floating point inet strings Represents an IP address, IPv4 or IPv6 int integers Represents 32-bit signed int text strings Represents UTF8 encoded string timestamp integers, strings Represents a timestamp timeuuid uuids Represents type 1 UUID uuid uuids Represents type 1 or type 4 UUID varchar strings Represents uTF8 encoded string varint integers Represents arbitrary-precision integer
  • 60.
  • 61.