2.Introduction to NOSQL (Core concepts).pptx

I N T R O D U C T I O N T O N O S Q L
Chapter 2

What is a NoSQL Database?
 NoSQL is a type of database management system
(DBMS) that is designed to handle and store large
volumes of unstructured and semi-structured data.
 Unlike traditional relational databases that use
tables with pre-defined schemas to store data,
NoSQL databases use flexible data models that can
adapt to changes in data structures and are capable
of scaling horizontally to handle growing amounts
of data.

 A NoSQL Database, also known as a non SQL or
non-relational Database is a non-tabular Database
that stores data differently than the tabular
relations used in relational databases.
 Companies widely used NoSQL Database
generally for big Data and real-time web
applications.

 NoSQL Databases offer a simple design, horizontal
scaling for clustering machines, and limit the
object-relational impedance mismatch.
 It uses different data structures from those used
by relational Databases making some operations
faster.
 NoSQL Databases are designed to be flexible,
scalable, and capable of rapidly responding to the
data management demands of modern businesses.

Why NoSQL
 Dynamic schema: NoSQL databases do not have a fixed schema
and can accommodate changing data structures without the need
for migrations or schema alterations.
 Horizontal scalability: NoSQL databases are designed to scale
out by adding more nodes to a database cluster, making them well-
suited for handling large amounts of data and high levels of traffic.
 Document-based: Some NoSQL databases, such as MongoDB,
use a document-based data model, where data is stored in
a scalessemi-structured format, such as JSON or BSON.
 Key-value-based: Other NoSQL databases, such as Redis, use a
key-value data model, where data is stored as a collection of key-
value pairs.
 Column-based: Some NoSQL databases, such as Cassandra, use a
column-based data model, where data is organized into columns
instead of rows.

 Distributed and high availability: NoSQL databases
are often designed to be highly available and to
automatically handle node failures and data replication
across multiple nodes in a database cluster.
 Flexibility: NoSQL databases allow developers to store
and retrieve data in a flexible and dynamic manner, with
support for multiple data types and changing data
structures.
 Performance: NoSQL databases are optimized for high
performance and can handle a high volume of reads and
writes, making them suitable for big data and real-time
applications.

Aggregate Data Models
 Aggregate means a collection of objects that are treated as a unit. In
NoSQL Databases, an aggregate is a collection of data that interact
as a unit. Moreover, these units of data or aggregates of data form
the boundaries for the ACID operations.
 Aggregate Data Models in NoSQL make it easier for the Databases
to manage data storage over the clusters as the aggregate data or
unit can now reside on any of the machines. Whenever data is
retrieved from the Database all the data comes along with the
Aggregate Data Models in NoSQL.
 Aggregate Data Models in NoSQL don’t support ACID transactions
and sacrifice one of the ACID properties. With the help of Aggregate
Data Models in NoSQL, you can easily perform OLAP operations on
the Database

Types of Aggregate Data Models
Key-Value
Model
Document
Model
Column
Family Model
Graph-Based
Model
Types

Key-Value Model
 The Key-Value Data Model contains the key or an ID used
to access or fetch the data of the aggregates
corresponding to the key.
 In this Aggregate Data Models in NoSQL, the data of the
aggregates are secure and encrypted and can be
decrypted with a Key

 They are a simpler type of database where each item
contains keys and values.
 A value can typically only be retrieved by referencing
its key, so learning how to query for a specific key-
value pair is typically simple.
 Key-value databases are great for use cases where
you need to store large amounts of data but you don’t
need to perform complex queries to retrieve it.
 Common use cases include storing user preferences
or caching. Redis and DynanoDB are popular key-
value databases.

 examples of key-value databases:
 Couchbase: It permits SQL-style querying and
searching for text.
 Amazon DynamoDB: The key-value database which is
mostly used is Amazon DynamoDB as it is a trusted
database used by a large number of users. It can easily
handle a large number of requests every day and it also
provides various security options.
 Riak: It is the database used to develop applications.
 Aerospike: It is an open-source and real-time database
working with billions of exchanges.
 Berkeley DB: It is a high-performance and open-source
database providing scalability.

Document Model
 It store data in documents similar to JSON (JavaScript Object
Notation) objects.
 Each document contains pairs of fields and values.
 The values can typically be a variety of types including things like
strings, numbers, booleans, arrays, or objects, and their structures
typically align with objects developers are working with in code.
 Because of their variety of field value types and powerful query
languages, document databases are great for a wide variety of use cases
and can be used as a general purpose database.
 They can horizontally scale-out to accommodate large data volumes.
 MongoDB is consistently ranked as the world’s most popular NoSQL
database according to DB-engines and is an example of a document
database.

 Examples of Document Data Models :
 Amazon DocumentDB
 MongoDB
 Cosmos DB
 ArangoDB
 Couchbase Server
 CouchDB

Column Family Model
 It store data in tables, rows, and dynamic columns.
 Wide-column stores provide a lot of flexibility over
relational databases because each row is not required to
have the same columns.
 Many consider wide-column stores to be two-
dimensional key-value databases.
 Wide-column stores are great for when you need to store
large amounts of data and you can predict what your
query patterns will be.
 Wide-column stores are commonly used for storing
Internet of Things data and user profile data.
 Cassandra and HBase are two of the most popular wide-
column stores.

 the first level of the Column family contains the keys
that act as a row identifier that is used to select the
aggregate data. Whereas the second level values are
referred to as columns

Graph-Based Model
 Graph or network data models consider the
relationship between two pieces of information to be
as meaningful as the information itself.
 As such, this data model is really made for any
information you would typically represent in a chart.
 It uses relationships and nodes, where the data is the
information itself, and the connection is created
between the nodes.

 It store data in nodes and edges.
 Nodes typically store information about people,
places, and things while edges store information
about the relationships between the nodes.
 Graph databases excel in use cases where you need
to traverse relationships to look for patterns such as
social networks, fraud detection, and
recommendation engines.
 Neo4j and JanusGraph are examples of graph
databases.

 Graph-based Data Models are used in social
networking sites to store interconnections.
 It is used in fraud detection systems.
 This Data Model is also widely used in Networks and
IT operations.


Distribution Models
 The primary driver of interest in NoSQL has been its
ability to run databases on a large cluster.
 As data volumes increase, it becomes more difficult
and expensive to scale up—buy a bigger server to run
the database on.
 A more appealing option is to scale out—run the
database on a cluster of servers.
 Aggregate orientation fits well with scaling out
because the aggregate is a natural unit to use for
distribution.

Distribution
Replication
master-slave peer-to-peer
Sharding

Single Server
 NO distribution at all
 Run the database on a single machine that handles
all the reads and writes to the data store
 it eliminates all the complexities
 it’s easy for operations people to manage and easy
for application developers to reason about

Single Server
NO
distribution
at all
Run the
database on
a single
machine
that handles
all the reads
and writes to
the data
store
it eliminates
all the
complexities
it’s easy for
operations
people to
manage and
easy for
application
developers
Graph
databases

 Replication
 Replication takes the same data and copies it over
multiple nodes.
 Two Types:
 master-slave
 peer-to-peer

Master-Slave Replication
 Replicate data across multiple nodes.
 One node is designated as the master, or primary.
 This master is the authoritative source for the data
and is usually responsible for processing any updates
to that data.
 The other nodes are slaves, or secondaries.
 A replication process synchronizes the slaves with
the master

 Advantages
 read-intensive dataset.
 You can scale horizontally to handle more read requests
by adding more slave nodes and ensuring that all read
requests are routed to the slaves
 read resilience
 Should the master fail, the slaves can still handle
read requests. Again, this is useful if most of
your data access is reads. The failure of the
master does eliminate the ability to handle
writes until either the master is restored or a
new master is appointed.

 appoint a slave
 replace a failed master means that master-slave replication
is useful even if you don’t need to scale out. All read and
write traffic can go to the master while the slave acts as a hot
backup.
 Masters can be appointed manually or
automatically
 Manual appointing typically means that when you configure
your cluster, you configure one node as the master.
 With automatic appointment, you create a cluster of nodes
and they elect one of themselves to be the master.

 Disadvantages:
 Inconsistency
 You have the danger that different clients, reading different
slaves, will see different values because the changes haven’t all
propagated to the slaves.
 if the master fails, any updates not passed on to the
backup are lost.

Peer-to-Peer Replication
 Peer-to-peer replication attacks these problems by
not having a master.
 All the replicas have equal weight, they can all accept
writes, and the loss of any of them doesn’t prevent
access to the data store.
 With a peer-to-peer replication cluster, you can ride
over node failures without losing access to data.
 Furthermore, you can easily add nodes to improve
your performance.

 Problem: Inconsistency:
 When you can write to two different places, you run
the risk that two people will attempt to update the
same record at the same time—a write-write conflict.
 Inconsistencies on read lead to problems but at least
they are relatively transient.
 Inconsistent writes are forever.

 Solution:
 1. we can ensure that whenever we write data, the
replicas coordinate to ensure to avoid a conflict.
don’t need all the replicas to agree on the write, just a
majority, so we can still survive losing a minority of
the replica nodes
2. we can decide to cope with an inconsistent write.
There are contexts when we can come up with policy
to merge inconsistent writes. In this case we can get
the full performance benefit of writing to any replica.

Sharding
 a busy data store is busy because different people are
accessing different parts of the dataset.
 In these circumstances we can support horizontal
scalability by putting different parts of the data onto
different servers—a technique that’s called
sharding

 In the ideal case, we have different users all talking
to different server nodes. Each user only has to talk
to one server, so gets rapid responses from that
server.
 The load is balanced out nicely between servers
 for example, if we have ten servers, each one only has to
handle 10% of the load.

 In order to get close to it we have to ensure that data
that’s accessed together is clumped together on the
same node and that these clumps are arranged on
the nodes to provide the best data access
 we design them to combine data that’s commonly
accessed together—so aggregates leap out as an
obvious unit of distribution

 most accesses of certain aggregates are based on a
physical location,
 you can place the data close to where it’s being
accessed.
 Another factor is trying to keep the load even. This
means that you should try to arrange aggregates so
they are evenly distributed across the nodes which all
get equal amounts of the load.

 Many NoSQL databases offer auto-sharding,
where the database takes on the responsibility
of allocating data to shards and ensuring that data
access goes to the right shard.
 Sharding improve both read and write performance
 Sharding provides a way to horizontally scale writes.
 A node failure makes that shard’s data unavailable.
 It’s not good to have a database with part of its data
missing

Combining Sharding and Replication
 If we use both master-slave replication and sharding
this means that we have multiple masters, but each
data item only has a single master.
 Depending on your configuration, you may choose a
node to be a master for some data and slaves for
others, or you may dedicate nodes for master or slave
duties.

Using master-slave replication together with sharding

 Using peer-to-peer replication and sharding is a
common strategy for column-family databases.
 In a scenario like this you might have tens or
hundreds of nodes in a cluster with data sharded
over them.
 A good starting point for peer-to-peer replication is
to have a replication factor of 3, so each shard is
present on three nodes.
 Should a node fail, then the shards on that node will
be built on the other nodes

Using peer-to-peer replication together with sharding

Consistency
 Biggest change from a centralized relational database
to a cluster oriented NoSQL
 Relational databases: strong consistency
 NoSQL systems: mostly eventual consistency

 Update Consistency
 Update is slightly differently, because each uses a slightly
different format.
 This issue is called a write-write conflict: two people
updating the same data item at the same time.
 When the writes reach the server, the server will serialize
them—decide to apply one, then the other.
 Two people updating the same data item at the same time If
the server serialize them: one is applied and immediately
overwritten by the other (lost update)

 Read-read (or simply read) conflict:
 Different people see different data at the same time
 Stale data: out of date
 Replication is a source of inconsistency

 Read-write conflict
 A read in the middle of two logically-related writes

 Solutions
 Pessimistic approach
 Prevent conflicts from occurring.
 Usually implemented with write locks managed by the system
 Optimistic approach
 Lets conflicts occur, but detects them and takes action to sort them
out
 Approaches (for write-write conflicts):
 conditional updates: test the value just before updating
 save both updates: record that they are in conflict and then
merge them

 Pessimistic v/s optimistic approach
 Concurrency involves a fundamental tradeoff between:
 consistency (avoiding errors such as update conflicts) and
 availability (responding quickly to clients).
 Pessimistic approaches often:
 severely degrade the responsiveness of a system
 leads to deadlocks, which are hard to prevent and debug.

 Forms of consistency
 Strong (or immediate) consistency :
 ACID transaction
 Logical consistency :
 No read-write conflicts (atomic transactions)
 Sequential consistency :
 Updates are serialized
 Session (or read-your-writes) consistency
 Within a user’s session
 Eventual consistency
 You may have replication inconsistencies but eventually all nodes
will be updated to the same value

 Relaxing Consistency
 Consistency is a Good Thing—but, sadly, sometimes
we have to sacrifice it.
 It is always possible to design a system to avoid
inconsistencies, but often impossible to do so
without making unbearable sacrifices in other
characteristics of the system.
 As a result, we often have to tradeoff consistency for
something else.
 Trading off consistency is a familiar concept even in
single-server relational database systems

 Enforce consistency is the transaction, and transactions
can provide strong consistency guarantees.
 Transaction systems usually come with the ability to
relax isolation levels, allowing queries to read data that
hasn’t been committed yet, and in practice we see most
applications relax consistency down from the highest
isolation level (serialized) in order to get effective
performance.
 We most commonly see people using the read-committed
transaction level, which eliminates some read-write
conflicts but allows others

The CAP Theorem
 Proposed by Eric Brewer in 2000 and given a formal
 proof by Seth Gilbert and Nancy Lynch [Lynch and
Gilbert] a couple of years later.
 “Given the three properties of Consistency,
Availability, and Partition tolerance, you can only get
two.”

 CAP states that in case of failures you can have at
most two of these three properties for any shared-
data system
 To scale out, you have to distribute resources.
 P is not really an option but rather a need
 The real selection is among Consistency or Availability
 In almost all cases, you would choose availability over
consistency.

 Consistency:
 all people see the same data at the same time
 Availability:
 guarantee that every request receives a response about
whether it was successful or failed.
 However, it does not guarantee that a read request returns the
most recent write.
 The more number of users a system can cater to better is the
availability.

 Partition tolerance:
 The system continues to operate despite communication
breakages that separate the cluster into partitions unable to
communicate with each other
 Out of these three guarantees, no system can provide
more than 2 guarantees.
 Since in the case of distributed systems, the
partitioning of the network is a must, the tradeoff is
always between consistency and availability.

With two breaks in the communication lines, the network
partitions into two groups.

 Martin and Pramod are both trying to book the last hotel
room on a system that uses peer-to-peer distribution
with two nodes (London for Martin and Mumbai for
Pramod).
 If we want to ensure consistency, then when Martin tries
to book his room on the London node, that node must
communicate with the Mumbai node before confirming
the booking.
 Essentially, both nodes must agree on the serialization of
their requests.
 This gives us consistency—but should the network link
break, then neither system can book any hotel room,
sacrificing availability.

 One way to improve availability is to designate one
node as the master for a particular hotel and ensure
all bookings are processed by that master. Should
that master be Mumbai, then Mumbai can still
process hotel bookings for that hotel and Pramod
will get the last room, whereas Martin can see the
inconsistent room information but cannot make a
booking (which would in this case cause an update
inconsistency). This is a lack of availability for
Martin.

 To gain more availability, we might allow both
systems to keep accepting hotel reservations even if a
link in the network breaks down.
 But this may cause both Martin a Pramod book the
same room => Inconsistency.
 But in this domain it might be tolerated somehow:
the travel company may tolerate some overbooking;
some hotels might always keep a few rooms clear
even when they are fully booked; Some hotels might
even cancel the booking with an apology once they
detected the conflict

 CA(Consistency and Availability)-
 The system prioritizes availability over consistency and can
respond with possibly stale data.
 Example databases: Cassandra, CouchDB, Riak, Voldemort.
 AP(Availability and Partition Tolerance)-
 The system prioritizes availability over consistency and can
respond with possibly stale data.
 The system can be distributed across multiple nodes and is
designed to operate reliably even in the face of network
partitions.
 Example databases: Amazon DynamoDB, Google Cloud
Spanner.

 CP(Consistency and Partition Tolerance)-
 The system prioritizes consistency over availability and
responds with the latest updated data.
 The system can be distributed across multiple nodes and is
designed to operate reliably even in the face of network
partitions.
 Example databases: Apache HBase, MongoDB, Redis.

Version Stamps
 Provide a means of detecting concurrency conflicts
 Each data item has a version stamp which gets incremented
each time the item is updated
 Before updating a data item, a process can check its version
stamp to see if it has been updated since it was last read
 Implementation methods
 Counter – requires a single master to “own” the counter
 GUID (Guaranteed Unique ID) – can be computed by any
node, but are large and cannot be compared directly
 Hash the contents of a resource
 Timestamp of last update – node clocks must be synchronized

 Counter – requires a single master to “own” the
counter
 Always incrementing it when you update the resource.
 Counters are useful since they make it easy to tell if one version is
more recent than another.
 On the other hand, they require the server to generate the counter
value, and also need a single master to ensure the counters aren’t
duplicated.

 For example.
 R1 to R6 are replicas and M1 is master (which replica R7)C is
counter variable. Its value is 3. a=1(database item).All replicas
are having database item value i.e. a= 1. Following figure shows
all replicas are having value of database item a=1 and C=3. The
above situation is shown in following figure .

 Now, Replica R3 want to update database item value to 6 i.e., a=6.
So, its counter value should be incremented first.
 Server at master side will increment the value of counter by 1. i.e.
C=4. So, the values of database item and counter are reflected at
replica R3 and Master.
 Now C=4 is the recent (highest counter value) value, so this update is
recent.
 Master will propagate to all replicas this update
 Now if any other replica wants to update database item value, then
its counter value should be updated first and communicated to
master then master will communicate it to the other replicas.

 GUID (Guaranteed Unique ID) – can be computed
by any node, but are large and cannot be compared
directly.
 The large random number which is to be used to be assured as
unique.
 This random number is created by taking combination of
dates, hardware information or any other way.
 These GUID i.e., random numbers will never be same.
 The disadvantage is that the random number is very large so
comparison of then for their recentness will be difficult.

 Hashing -Hash the contents of a resource
 With a big enough hash key size, a content hash can be globally
unique like a GUID and can also be generated by anyone;
 Advantage is that they are deterministic—any node will
generate the same content hash for same resource data.
 However, like GUIDs they can’t be directly compared for
recentness, and they can be lengthy

 For example, consider database item a=1, replicas R1 to R6. R7
replica is Master. Replica R1 wat to update database item a to
value 43. With the help of simple hashing function of mod 10
(10 is hash key) it will generate bucket. i.e., 43%10= 3 Now it is
easy to find out that replica R1 is having recent updates

 Timestamp of last update – node clocks must be
synchronized
 If any update is made timestamp will be checked.
 Their working is same as counters, and they can be compared for
their recentness.
 In this situation many replicas(machines) can generate timestamps.
 One thing must keep in mind that all machines’ clocks should be
synchronized with each other.
 If any replica will have bad clock (or its clock is not working properly
in synchronization manner) then data corruption problem will arise.
 Database administrators must check granularity of timestamps
otherwise timestamps will also get duplicated. Timestamps are good
if milliseconds precision will be used.

 Now, Replica R2 want to update database item value to 4 i.e.,
a=4. R2s timestamps is by millisecond updated and it is now
TS=-3-25-22:02:31:29,571 so it is compared with other
replicas time stamp and 571>570 (with millisecond precision)
so R2 contains most updated value

Map-Reduce
 When you have a cluster, you have lots of machines
to spread the computation over.
 However, you also still need to try to reduce the
amount of data that needs to be transferred across
the network.
 The map-reduce pattern is a way to organize
processing in such a way as to take advantage of
multiple machine on a cluster while keeping as much
processing and the data it needs together on the
same machine.

 This programming model gained prominence with
Google’s MapReduce framework [Dean and
Ghemawat, OSDI-04].
 A widely used open-source implementation is part of
the Apache Hadoop project.
 The name “map-reduce” reveals its inspiration from
the map and reduce operations on collections in
functional programming languages

 Map Reduce – benefits
 Complex details are abstracted away from the
 developer
– No file I/O
– No networking code
– No synchronization
 It’s scalable because you process one record at a time
 A record consists of a key and corresponding value

 Example:
 Let us consider the usual scenario of customers and
orders
We have chosen order as our
aggregate, with each order having
line items.
Each line item has product ID,
quantity, an the price charged.
Sales analysis people want to see a
product and its total revenue for the last
seven days.

 In order to get the product revenue report, you’ll
have to visit every machine in the cluster and
examine many records on each machine.
 This is exactly the kind of situation that calls for
map-reduce. Again, the first stage in a map-reduce
job is the map.
 A map is a function whose input is a single aggregate
and whose output is a bunch of key-value pairs.
 In this case, the input would be an order, and the
output would be key-value pairs corresponding to
the line items

 For this example, we are just selecting a value out of
the record, but there’s no reason why we can’t carry
out some arbitrarily complex function as part of the
map—providing it only depends on one aggregate’s
worth of data.

 Each such pair would have the product ID as the key and an embedded
map with the quantity and price as the value

 The reduce function takes multiple map outputs with the same key and
combines their values

 A map function might yield 1000 line items from
orders for “Database Refactoring”; the reduce
function would reduce down to one, with the totals
for the quantity and revenue.
 While the map function is limited to working only on
data from a single aggregate, the reduce function can
use all values emitted for a single key

Partitioning and Combining
 To increase parallelism, we can also partition the output of the
mappers and send each partition to a different reducer
(“shuffling”)
 To take advantage of this, the results of the mapper are
divided up based the key on each processing node.
 Typically, multiple keys are grouped together into partitions.
 The framework then takes the data from all the nodes for one
partition, combines it into a single group for that partition,
and sends it off to a reducer.
 Multiple reducers can then operate on the partitions in
parallel, with the final results merged together.
 (This step is also called “shuffling,” and the partitions are
sometimes referred to as “buckets” or “regions.”)

 The next problem we can deal with is the amount of data
being moved from node to node between the map and
reduce stages.
 Much of this data is repetitive, consisting of multiple key-
value pairs for the same key.
 A combiner function cuts this data down by combining
all the data for the same key into a single value A
combiner function is, in essence, a reducer function—
indeed, in many cases the same function can be used for
combining as the final reduction.
 The reduce function needs a special shape for this to
work: Its output must match its input. We call such a
function a combinable reducer.

 When you have combining reducers, the map-reduce
framework can safely run not only in parallel (to reduce
different partitions), but also in series to reduce the same
partition at different times and places.
 In addition to allowing combining to occur on a node before
data transmission, you can also start combining before
mappers have finished.
 This provides a good bit of extra flexibility to the map-reduce
processing. Some map-reduce frameworks require all
reducers to be combining reducers, which maximizes this
flexibility.
 If you need to do a noncombining reducer with one of these
frameworks, you’ll need to separate the processing into
pipelined map-reduce steps.

2.Introduction to NOSQL (Core concepts).pptx

Recommended

Recommended

More Related Content

Similar to 2.Introduction to NOSQL (Core concepts).pptx

Similar to 2.Introduction to NOSQL (Core concepts).pptx (20)

More from RushikeshChikane2

More from RushikeshChikane2 (10)

Recently uploaded

Recently uploaded (20)

2.Introduction to NOSQL (Core concepts).pptx