1. NOSQL KEY-VALUE
DATABASE
1
Lecture 2
Dr. Shaimaa Galal
Review Question
• What is the main challenge of the traditional databases?
Managing of semi-structured and unstructured data.
Managing large amounts of structured data.
2
Question
3
4
Key-value database
• Example: (DynamoDB)
• items having one or more attributes
(name, value)
• An attribute can be single-valued or
multi-valued like set.
• items are combined into a table
• key-value database is a system that stores values indexed
by keys. It can store structured and unstructured data.
• Focus on scaling to huge amounts of data designed to
handle massive data loads
• Data model: (global) collection of Key-value pairs.
Key-value
Pros:
• very fast
• very scalable (horizontally distributed to nodes based on key)
• simple data model
• eventual consistency
• fault-tolerance
Cons:
- Can’t model more complex data structure such as objects
5
Big Data: Google
6
1. Google Stack Software
• Google developed major software layers as foundation for
google platform:
1. Google File System (GFS): a distributed cluster file
system that allows all of the disks within the Google
data center to be accessed as one massive, distributed,
redundant file system.
2. MapReduce: a distributed processing framework for
parallelizing algorithms across large numbers of
potentially unreliable servers and being capable of
dealing with massive datasets.
3. BigTable: a nonrelational database system that uses
the Google File System for storage.
7
Google Software Architecture
8
Simple MapReduce Example: WordCount
9
Map Function
10
Reduce Function
11
MultiStage MapReduce Example
12
2. Hadoop and Hive
13
14
Key-value Database API Functions:
Key-value
• Basic API access:
• Get(key): extract the value given a key
• Put(key, value): create or update the value given its key
• Delete(key): remove the key and its associated value
• Update(key, value): create or update the value given its key
• Execute(key, operation, parameters): invoke an operation to the
value (given its key) which is a special data structure (e.g. List, Set,
Map .... etc)
15
Key-value Platforms
16
Name Producer Data model Querying
SimpleDB Amazon set of couples (key, {attribute}),
where attribute is a couple
(name, value)
restricted SQL; select, delete,
GetAttributes, and
PutAttributes operations
Redis Salvatore
Sanfilippo
set of couples (key, value),
where value is simple typed
value, list, ordered (according
to ranking) or unordered set,
hash value
primitive operations for each
value type
Dynamo Amazon like SimpleDB simple get operation and put
in a context
Voldemort LinkeId like SimpleDB similar to Dynamo
Apache Cassandra
• Is a free and open-source distributed NoSQL database
management.
• Handles large amounts of data across many commodity
servers, providing high availability with no single point
of failure.
• It was started by Facebook and it is an open source
Apache project written in Java.
17
18
DataStax Astra
19
Apache Cassandra - Advantages
1. Cassandra is developed to be a distributed server, but it
can also be run as a simple node.
2. Horizontal scalability (Distributed storage.).
3. Quick answers even if demand grows.
4. High write speeds to manage incremental data volumes
5. Ability to change the data structure.
6. A simple API for your favorite programming language.
7. Automatic fault detection and fault tolerant.
8. There is no single point of failure which means that each
node knows about the others.
9. Decentalized.
10.Allows the use of Hadoop to use Map Reduce.
20
21
Apache Cassandra - Disadvantages
1. Ad-hoc queries: You must model your data
around the queries, rather than around the
structure of data.
2. No-Aggregations: because Cassandra is a key-
value store doing functions like Sum, Min, Max,
and Average are incredibly resource intensive if
even possible to accomplish.
3. Unpredictable performance: Because
Cassandra has many different Asynchronous Jobs
in the background.
22
Comparing Alternatives
23
24
25
Cassandra Gossip Protocol
• What is Gossip protocol ?
Gossip is the message system
that Cassandra nodes, virtual
nodes used to make their data
consistent with each other.
A node has a data replica. If
something goes wrong, a
replica can respond. The
replication_factor parameter
in the creation of a KeySpace
(database) indicates how many
machines in the cluster will
receive copies of the same
data.
26
27
Key-Value Concepts
• Cassandra manages columns and family of columns.
• Column family is a container of rows containing columns.
• A keyspace is analogous to a database in a relational
model but without interrelations (stores data).
• The keyspaces require that some attributes be defined,
such as user-defined names, replication strategies and
others.
28
Key-Value Concepts
• These KeySpaces require configuration according to
consistency that are:
1. The replication factor which indicates how much do you
want to pay performance in favor of consistency.
2. The replica placement strategy, which indicates how
the replicas are placed in the ring such as
SimpleStrategy, OldNetwork TopologyStrategy, and
NetworkTopologyStrategy.
• Read more: https://docs.datastax.com/en/cassandra-
oss/2.1/cassandra/architecture/architectureDataDistributeR
eplication_c.html#architectureDataDistributeReplication_c_
_networkToplogyStrategy-ph
29
30
31
32
CQL (Cassandra Query Language)
• CQL offers a more than close to SQL to create schema
and manipulate data.
33
Some of the features CQL has are:
• Data types • Security
• Data definition • Functions
• Data manipulation • Arithmetic operations
• Secondary indexes • JSON support
• Materialized views • Triggers
CQL Example
34
Use Case
35
Use Case
36

2. Lecture2_NOSQL_KeyValue.ppt

  • 1.
  • 2.
    Review Question • Whatis the main challenge of the traditional databases? Managing of semi-structured and unstructured data. Managing large amounts of structured data. 2
  • 3.
  • 4.
    4 Key-value database • Example:(DynamoDB) • items having one or more attributes (name, value) • An attribute can be single-valued or multi-valued like set. • items are combined into a table • key-value database is a system that stores values indexed by keys. It can store structured and unstructured data. • Focus on scaling to huge amounts of data designed to handle massive data loads • Data model: (global) collection of Key-value pairs.
  • 5.
    Key-value Pros: • very fast •very scalable (horizontally distributed to nodes based on key) • simple data model • eventual consistency • fault-tolerance Cons: - Can’t model more complex data structure such as objects 5
  • 6.
  • 7.
    1. Google StackSoftware • Google developed major software layers as foundation for google platform: 1. Google File System (GFS): a distributed cluster file system that allows all of the disks within the Google data center to be accessed as one massive, distributed, redundant file system. 2. MapReduce: a distributed processing framework for parallelizing algorithms across large numbers of potentially unreliable servers and being capable of dealing with massive datasets. 3. BigTable: a nonrelational database system that uses the Google File System for storage. 7
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
    Key-value • Basic APIaccess: • Get(key): extract the value given a key • Put(key, value): create or update the value given its key • Delete(key): remove the key and its associated value • Update(key, value): create or update the value given its key • Execute(key, operation, parameters): invoke an operation to the value (given its key) which is a special data structure (e.g. List, Set, Map .... etc) 15
  • 16.
    Key-value Platforms 16 Name ProducerData model Querying SimpleDB Amazon set of couples (key, {attribute}), where attribute is a couple (name, value) restricted SQL; select, delete, GetAttributes, and PutAttributes operations Redis Salvatore Sanfilippo set of couples (key, value), where value is simple typed value, list, ordered (according to ranking) or unordered set, hash value primitive operations for each value type Dynamo Amazon like SimpleDB simple get operation and put in a context Voldemort LinkeId like SimpleDB similar to Dynamo
  • 17.
    Apache Cassandra • Isa free and open-source distributed NoSQL database management. • Handles large amounts of data across many commodity servers, providing high availability with no single point of failure. • It was started by Facebook and it is an open source Apache project written in Java. 17
  • 18.
  • 19.
  • 20.
    Apache Cassandra -Advantages 1. Cassandra is developed to be a distributed server, but it can also be run as a simple node. 2. Horizontal scalability (Distributed storage.). 3. Quick answers even if demand grows. 4. High write speeds to manage incremental data volumes 5. Ability to change the data structure. 6. A simple API for your favorite programming language. 7. Automatic fault detection and fault tolerant. 8. There is no single point of failure which means that each node knows about the others. 9. Decentalized. 10.Allows the use of Hadoop to use Map Reduce. 20
  • 21.
  • 22.
    Apache Cassandra -Disadvantages 1. Ad-hoc queries: You must model your data around the queries, rather than around the structure of data. 2. No-Aggregations: because Cassandra is a key- value store doing functions like Sum, Min, Max, and Average are incredibly resource intensive if even possible to accomplish. 3. Unpredictable performance: Because Cassandra has many different Asynchronous Jobs in the background. 22
  • 23.
  • 24.
  • 25.
  • 26.
    Cassandra Gossip Protocol •What is Gossip protocol ? Gossip is the message system that Cassandra nodes, virtual nodes used to make their data consistent with each other. A node has a data replica. If something goes wrong, a replica can respond. The replication_factor parameter in the creation of a KeySpace (database) indicates how many machines in the cluster will receive copies of the same data. 26
  • 27.
  • 28.
    Key-Value Concepts • Cassandramanages columns and family of columns. • Column family is a container of rows containing columns. • A keyspace is analogous to a database in a relational model but without interrelations (stores data). • The keyspaces require that some attributes be defined, such as user-defined names, replication strategies and others. 28
  • 29.
    Key-Value Concepts • TheseKeySpaces require configuration according to consistency that are: 1. The replication factor which indicates how much do you want to pay performance in favor of consistency. 2. The replica placement strategy, which indicates how the replicas are placed in the ring such as SimpleStrategy, OldNetwork TopologyStrategy, and NetworkTopologyStrategy. • Read more: https://docs.datastax.com/en/cassandra- oss/2.1/cassandra/architecture/architectureDataDistributeR eplication_c.html#architectureDataDistributeReplication_c_ _networkToplogyStrategy-ph 29
  • 30.
  • 31.
  • 32.
  • 33.
    CQL (Cassandra QueryLanguage) • CQL offers a more than close to SQL to create schema and manipulate data. 33 Some of the features CQL has are: • Data types • Security • Data definition • Functions • Data manipulation • Arithmetic operations • Secondary indexes • JSON support • Materialized views • Triggers
  • 34.
  • 35.
  • 36.