Introduction To NOSQL
Databases
1
Lecture 1
Dr. Shaimaa Galal
Course Resources
• Text book: Part I (Chapters 1-7) - copyright 2015
Next Generation Databases
NoSQL, NewSQL, and Big Data
• Lab tool: Apache Cassandra on
https://www.datastax.com/
2
Additional Resources
• YouTube: NoSQL Database Tutorial – Full Course for
Beginners
https://www.coursera.org/learn/nosql-databases
• Coursera : NOSQL Systems
https://www.coursera.org/learn/nosql-databases
3
Background
• Relational databases → mainstay of business
• Web-based applications caused spikes
• explosion of social media sites (Facebook, Twitter) with large data
needs and various data types such as images, videos, and
documents.
• rise of cloud-based solutions such as Amazon S3 (simple storage
solution)
• Hooking RDBMS to web-based application becomes
trouble due to big data.
4
Current day trends
• Big Data
• Cloud Computing
• Solid State Disk
5
The solution is to scale up
Issues with scaling up
• Limits to scaling up (or vertical scaling: make a “single”
machine more powerful) → dataset is just too big!
6
Issues with scaling out
• Scaling out (or horizontal
scaling: adding more
smaller/cheaper servers) is a
better choice
• Different horizontal scaling
approaches (multi-node
database):
• Master/Slave.
• Sharding (partitioning)
7
Scaling out RDBMS: Master/Slave
• Master/Slave
• All writes are written to the master
• All reads performed against the replicated slave databases
• Critical reads may be incorrect as writes may not have been
propagated down
• Large datasets can pose problems as master needs to
duplicate data to slaves
8
Writes
Reads
Multi-Master replication
9
• INSERT only, not UPDATES/DELETES
• No JOINs, thereby reducing query time (this involves de-
normalizing data)
Scaling out RDBMS: Sharding
• Sharding (Partitioning) involves partitioning the data
across multiple databases based on a key-attribute.
• Scales well for both reads and writes
• Not transparent, application needs to be partition-aware
• Can no longer have relationships/joins across partitions
(loss of referential integrity across shards).
10
Scaling out RDBMS: Sharding
11
Other ways to scale out RDBMS
• In-memory databases
• Primarily relies on main memory for data storage.
• Faster than disk-optimized databases because disk
access is slower than memory access and the internal
optimization algorithms are simpler and execute fewer
CPU instructions.
• Accessing data in memory eliminates querying seek time.
12
Other ways to scale out RDBMS
• NewSQL
• Is a database that retain main key characteristics of
RDBMS but different from the common architecture
exhibited by Oracle and SQL.
• There are two designs:
1. H-Store: pure distributed InMemory database.
2. C-Store: Involves columnar database design
13
14
Three waves of database technology
What is NOSQL?
• The Name:
• Stands for Not Only SQL
• The term NOSQL was introduced by Carl Strozzi in 1998 to name
his file-based database
• It was again re-introduced by Eric Evans when an event was
organized to discuss open source distributed databases
• Eric states that “… but the whole point of seeking alternatives is
that you need to solve a problem that relational databases are a
bad fit for. …”
15
Who is using them?
16
What is NOSQL?
• Key features
(advantages):
• Non-relational and doesn’t
require schema.
• Data are replicated to multiple
nodes (fault-tolerant):
• down nodes easily replaced
• no single point of failure
• Horizontal scalable
• cheap, easy to implement
• Massive write performance
• Fast key-value access
17
18
Recap ACID Properties
19
What is NOSQL?
• Disadvantages:
• Don’t fully support relational features
• no join, group by, order by operations (except within partitions)
• no referential integrity constraints across partitions
• No declarative query language (e.g., SQL) → more programming
• Relaxed ACID (see CAP theorem) → fewer guarantees
• No easy integration with other applications that support SQL
20
Discussion
Are NoSQL databases better than
relational databases?
21
CAP Theorem
23
All client always have the
same view of the data
Consistency
Partition
tolerance
Availability
CAP Theorem
24
Each client always can
read and write.
Consistency
Partition
tolerance
Availability
CAP Theorem
25
A system can continue to
operate in the presence of
a network partitions
Consistency
Partition
tolerance
Availability
CAP Theorem (Consistency)
26
Write X row Read X row Is it the last
value of X?
Client A Client B
Answer is: Maybe,
however eventual
consistency occurs.
CAP Theorem
• Brewer’s CAP Theorem:
• For any system sharing data, it is “impossible” to guarantee
simultaneously all of these three properties
• You can have at most two of these three properties for any
shared-data system
• Very large systems will “partition” at some point:
• That leaves either C or A to choose from (traditional DBMS
prefers C over A and P )
• In almost all cases, you would choose A over C (except in
specific applications such as order processing)
• Example: Apache Cassandra focus on AP, while MongoDB
and Kafka focus on CP.
27
CAP Theorem summary
• ACID
• A DBMS is expected to support “ACID transactions,” processes that
are:
• Atomicity: either the whole process is done or none is
• Consistency: only valid data are written
• Isolation: one operation at a time
• Durability: once committed, it stays that way
• CAP
• Consistency: all data on cluster has the same copies
• Availability: cluster always accepts reads and writes
• Partition tolerance: guaranteed properties are maintained even
when network failures prevent some machines from communicating
with others
28
NO-SQL categories
1. Key-value
• Example: DynamoDB, Voldermort, Scalaris
2. Document-based
• Example: MongoDB, CouchDB
3. Column-based
• Example: BigTable, Cassandra, Hbased
4. Graph-based
• Example: Neo4J, InfoGrid
• “No-schema” is a common characteristics of most NOSQL
storage systems
• Provide “flexible” data types
29
Next Lecture
Key-value Database
30

1. Lecture1_NOSQL_Introduction.pdf

  • 1.
  • 2.
    Course Resources • Textbook: Part I (Chapters 1-7) - copyright 2015 Next Generation Databases NoSQL, NewSQL, and Big Data • Lab tool: Apache Cassandra on https://www.datastax.com/ 2
  • 3.
    Additional Resources • YouTube:NoSQL Database Tutorial – Full Course for Beginners https://www.coursera.org/learn/nosql-databases • Coursera : NOSQL Systems https://www.coursera.org/learn/nosql-databases 3
  • 4.
    Background • Relational databases→ mainstay of business • Web-based applications caused spikes • explosion of social media sites (Facebook, Twitter) with large data needs and various data types such as images, videos, and documents. • rise of cloud-based solutions such as Amazon S3 (simple storage solution) • Hooking RDBMS to web-based application becomes trouble due to big data. 4
  • 5.
    Current day trends •Big Data • Cloud Computing • Solid State Disk 5 The solution is to scale up
  • 6.
    Issues with scalingup • Limits to scaling up (or vertical scaling: make a “single” machine more powerful) → dataset is just too big! 6
  • 7.
    Issues with scalingout • Scaling out (or horizontal scaling: adding more smaller/cheaper servers) is a better choice • Different horizontal scaling approaches (multi-node database): • Master/Slave. • Sharding (partitioning) 7
  • 8.
    Scaling out RDBMS:Master/Slave • Master/Slave • All writes are written to the master • All reads performed against the replicated slave databases • Critical reads may be incorrect as writes may not have been propagated down • Large datasets can pose problems as master needs to duplicate data to slaves 8 Writes Reads
  • 9.
    Multi-Master replication 9 • INSERTonly, not UPDATES/DELETES • No JOINs, thereby reducing query time (this involves de- normalizing data)
  • 10.
    Scaling out RDBMS:Sharding • Sharding (Partitioning) involves partitioning the data across multiple databases based on a key-attribute. • Scales well for both reads and writes • Not transparent, application needs to be partition-aware • Can no longer have relationships/joins across partitions (loss of referential integrity across shards). 10
  • 11.
    Scaling out RDBMS:Sharding 11
  • 12.
    Other ways toscale out RDBMS • In-memory databases • Primarily relies on main memory for data storage. • Faster than disk-optimized databases because disk access is slower than memory access and the internal optimization algorithms are simpler and execute fewer CPU instructions. • Accessing data in memory eliminates querying seek time. 12
  • 13.
    Other ways toscale out RDBMS • NewSQL • Is a database that retain main key characteristics of RDBMS but different from the common architecture exhibited by Oracle and SQL. • There are two designs: 1. H-Store: pure distributed InMemory database. 2. C-Store: Involves columnar database design 13
  • 14.
    14 Three waves ofdatabase technology
  • 15.
    What is NOSQL? •The Name: • Stands for Not Only SQL • The term NOSQL was introduced by Carl Strozzi in 1998 to name his file-based database • It was again re-introduced by Eric Evans when an event was organized to discuss open source distributed databases • Eric states that “… but the whole point of seeking alternatives is that you need to solve a problem that relational databases are a bad fit for. …” 15
  • 16.
    Who is usingthem? 16
  • 17.
    What is NOSQL? •Key features (advantages): • Non-relational and doesn’t require schema. • Data are replicated to multiple nodes (fault-tolerant): • down nodes easily replaced • no single point of failure • Horizontal scalable • cheap, easy to implement • Massive write performance • Fast key-value access 17
  • 18.
  • 19.
  • 20.
    What is NOSQL? •Disadvantages: • Don’t fully support relational features • no join, group by, order by operations (except within partitions) • no referential integrity constraints across partitions • No declarative query language (e.g., SQL) → more programming • Relaxed ACID (see CAP theorem) → fewer guarantees • No easy integration with other applications that support SQL 20
  • 21.
    Discussion Are NoSQL databasesbetter than relational databases? 21
  • 22.
    CAP Theorem 23 All clientalways have the same view of the data Consistency Partition tolerance Availability
  • 23.
    CAP Theorem 24 Each clientalways can read and write. Consistency Partition tolerance Availability
  • 24.
    CAP Theorem 25 A systemcan continue to operate in the presence of a network partitions Consistency Partition tolerance Availability
  • 25.
    CAP Theorem (Consistency) 26 WriteX row Read X row Is it the last value of X? Client A Client B Answer is: Maybe, however eventual consistency occurs.
  • 26.
    CAP Theorem • Brewer’sCAP Theorem: • For any system sharing data, it is “impossible” to guarantee simultaneously all of these three properties • You can have at most two of these three properties for any shared-data system • Very large systems will “partition” at some point: • That leaves either C or A to choose from (traditional DBMS prefers C over A and P ) • In almost all cases, you would choose A over C (except in specific applications such as order processing) • Example: Apache Cassandra focus on AP, while MongoDB and Kafka focus on CP. 27
  • 27.
    CAP Theorem summary •ACID • A DBMS is expected to support “ACID transactions,” processes that are: • Atomicity: either the whole process is done or none is • Consistency: only valid data are written • Isolation: one operation at a time • Durability: once committed, it stays that way • CAP • Consistency: all data on cluster has the same copies • Availability: cluster always accepts reads and writes • Partition tolerance: guaranteed properties are maintained even when network failures prevent some machines from communicating with others 28
  • 28.
    NO-SQL categories 1. Key-value •Example: DynamoDB, Voldermort, Scalaris 2. Document-based • Example: MongoDB, CouchDB 3. Column-based • Example: BigTable, Cassandra, Hbased 4. Graph-based • Example: Neo4J, InfoGrid • “No-schema” is a common characteristics of most NOSQL storage systems • Provide “flexible” data types 29
  • 29.