1. The document discusses key-value databases and provides an overview of Apache Cassandra. It describes Cassandra's data model, advantages like scalability and fault tolerance, and disadvantages like lack of aggregations.
2. It also covers Cassandra's gossip protocol for consistency, use of column families and keyspaces, and CQL for querying. An example use case presented is for user profile data in a social network.
3. Key concepts discussed include Cassandra's distributed architecture, replication factor, data distribution strategies, and CQL features for schema definition and data manipulation.
2. Review Question
• What is the main challenge of the traditional databases?
Managing of semi-structured and unstructured data.
Managing large amounts of structured data.
2
4. 4
Key-value database
• Example: (DynamoDB)
• items having one or more attributes
(name, value)
• An attribute can be single-valued or
multi-valued like set.
• items are combined into a table
• key-value database is a system that stores values indexed
by keys. It can store structured and unstructured data.
• Focus on scaling to huge amounts of data designed to
handle massive data loads
• Data model: (global) collection of Key-value pairs.
5. Key-value
Pros:
• very fast
• very scalable (horizontally distributed to nodes based on key)
• simple data model
• eventual consistency
• fault-tolerance
Cons:
- Can’t model more complex data structure such as objects
5
7. 1. Google Stack Software
• Google developed major software layers as foundation for
google platform:
1. Google File System (GFS): a distributed cluster file
system that allows all of the disks within the Google
data center to be accessed as one massive, distributed,
redundant file system.
2. MapReduce: a distributed processing framework for
parallelizing algorithms across large numbers of
potentially unreliable servers and being capable of
dealing with massive datasets.
3. BigTable: a nonrelational database system that uses
the Google File System for storage.
7
15. Key-value
• Basic API access:
• Get(key): extract the value given a key
• Put(key, value): create or update the value given its key
• Delete(key): remove the key and its associated value
• Update(key, value): create or update the value given its key
• Execute(key, operation, parameters): invoke an operation to the
value (given its key) which is a special data structure (e.g. List, Set,
Map .... etc)
15
16. Key-value Platforms
16
Name Producer Data model Querying
SimpleDB Amazon set of couples (key, {attribute}),
where attribute is a couple
(name, value)
restricted SQL; select, delete,
GetAttributes, and
PutAttributes operations
Redis Salvatore
Sanfilippo
set of couples (key, value),
where value is simple typed
value, list, ordered (according
to ranking) or unordered set,
hash value
primitive operations for each
value type
Dynamo Amazon like SimpleDB simple get operation and put
in a context
Voldemort LinkeId like SimpleDB similar to Dynamo
17. Apache Cassandra
• Is a free and open-source distributed NoSQL database
management.
• Handles large amounts of data across many commodity
servers, providing high availability with no single point
of failure.
• It was started by Facebook and it is an open source
Apache project written in Java.
17
20. Apache Cassandra - Advantages
1. Cassandra is developed to be a distributed server, but it
can also be run as a simple node.
2. Horizontal scalability (Distributed storage.).
3. Quick answers even if demand grows.
4. High write speeds to manage incremental data volumes
5. Ability to change the data structure.
6. A simple API for your favorite programming language.
7. Automatic fault detection and fault tolerant.
8. There is no single point of failure which means that each
node knows about the others.
9. Decentalized.
10.Allows the use of Hadoop to use Map Reduce.
20
22. Apache Cassandra - Disadvantages
1. Ad-hoc queries: You must model your data
around the queries, rather than around the
structure of data.
2. No-Aggregations: because Cassandra is a key-
value store doing functions like Sum, Min, Max,
and Average are incredibly resource intensive if
even possible to accomplish.
3. Unpredictable performance: Because
Cassandra has many different Asynchronous Jobs
in the background.
22
26. Cassandra Gossip Protocol
• What is Gossip protocol ?
Gossip is the message system
that Cassandra nodes, virtual
nodes used to make their data
consistent with each other.
A node has a data replica. If
something goes wrong, a
replica can respond. The
replication_factor parameter
in the creation of a KeySpace
(database) indicates how many
machines in the cluster will
receive copies of the same
data.
26
28. Key-Value Concepts
• Cassandra manages columns and family of columns.
• Column family is a container of rows containing columns.
• A keyspace is analogous to a database in a relational
model but without interrelations (stores data).
• The keyspaces require that some attributes be defined,
such as user-defined names, replication strategies and
others.
28
29. Key-Value Concepts
• These KeySpaces require configuration according to
consistency that are:
1. The replication factor which indicates how much do you
want to pay performance in favor of consistency.
2. The replica placement strategy, which indicates how
the replicas are placed in the ring such as
SimpleStrategy, OldNetwork TopologyStrategy, and
NetworkTopologyStrategy.
• Read more: https://docs.datastax.com/en/cassandra-
oss/2.1/cassandra/architecture/architectureDataDistributeR
eplication_c.html#architectureDataDistributeReplication_c_
_networkToplogyStrategy-ph
29
33. CQL (Cassandra Query Language)
• CQL offers a more than close to SQL to create schema
and manipulate data.
33
Some of the features CQL has are:
• Data types • Security
• Data definition • Functions
• Data manipulation • Arithmetic operations
• Secondary indexes • JSON support
• Materialized views • Triggers