SlideShare a Scribd company logo
1 of 96
I N T R O D U C T I O N T O N O S Q L
Chapter 2
What is a NoSQL Database?
 NoSQL is a type of database management system
(DBMS) that is designed to handle and store large
volumes of unstructured and semi-structured data.
 Unlike traditional relational databases that use
tables with pre-defined schemas to store data,
NoSQL databases use flexible data models that can
adapt to changes in data structures and are capable
of scaling horizontally to handle growing amounts
of data.
 A NoSQL Database, also known as a non SQL or
non-relational Database is a non-tabular Database
that stores data differently than the tabular
relations used in relational databases.
 Companies widely used NoSQL Database
generally for big Data and real-time web
applications.
 NoSQL Databases offer a simple design, horizontal
scaling for clustering machines, and limit the
object-relational impedance mismatch.
 It uses different data structures from those used
by relational Databases making some operations
faster.
 NoSQL Databases are designed to be flexible,
scalable, and capable of rapidly responding to the
data management demands of modern businesses.
Why NoSQL
 Dynamic schema: NoSQL databases do not have a fixed schema
and can accommodate changing data structures without the need
for migrations or schema alterations.
 Horizontal scalability: NoSQL databases are designed to scale
out by adding more nodes to a database cluster, making them well-
suited for handling large amounts of data and high levels of traffic.
 Document-based: Some NoSQL databases, such as MongoDB,
use a document-based data model, where data is stored in
a scalessemi-structured format, such as JSON or BSON.
 Key-value-based: Other NoSQL databases, such as Redis, use a
key-value data model, where data is stored as a collection of key-
value pairs.
 Column-based: Some NoSQL databases, such as Cassandra, use a
column-based data model, where data is organized into columns
instead of rows.
 Distributed and high availability: NoSQL databases
are often designed to be highly available and to
automatically handle node failures and data replication
across multiple nodes in a database cluster.
 Flexibility: NoSQL databases allow developers to store
and retrieve data in a flexible and dynamic manner, with
support for multiple data types and changing data
structures.
 Performance: NoSQL databases are optimized for high
performance and can handle a high volume of reads and
writes, making them suitable for big data and real-time
applications.
Aggregate Data Models
 Aggregate means a collection of objects that are treated as a unit. In
NoSQL Databases, an aggregate is a collection of data that interact
as a unit. Moreover, these units of data or aggregates of data form
the boundaries for the ACID operations.
 Aggregate Data Models in NoSQL make it easier for the Databases
to manage data storage over the clusters as the aggregate data or
unit can now reside on any of the machines. Whenever data is
retrieved from the Database all the data comes along with the
Aggregate Data Models in NoSQL.
 Aggregate Data Models in NoSQL don’t support ACID transactions
and sacrifice one of the ACID properties. With the help of Aggregate
Data Models in NoSQL, you can easily perform OLAP operations on
the Database
Types of Aggregate Data Models
Key-Value
Model
Document
Model
Column
Family Model
Graph-Based
Model
Types
Key-Value Model
 The Key-Value Data Model contains the key or an ID used
to access or fetch the data of the aggregates
corresponding to the key.
 In this Aggregate Data Models in NoSQL, the data of the
aggregates are secure and encrypted and can be
decrypted with a Key
 They are a simpler type of database where each item
contains keys and values.
 A value can typically only be retrieved by referencing
its key, so learning how to query for a specific key-
value pair is typically simple.
 Key-value databases are great for use cases where
you need to store large amounts of data but you don’t
need to perform complex queries to retrieve it.
 Common use cases include storing user preferences
or caching. Redis and DynanoDB are popular key-
value databases.
 examples of key-value databases:
 Couchbase: It permits SQL-style querying and
searching for text.
 Amazon DynamoDB: The key-value database which is
mostly used is Amazon DynamoDB as it is a trusted
database used by a large number of users. It can easily
handle a large number of requests every day and it also
provides various security options.
 Riak: It is the database used to develop applications.
 Aerospike: It is an open-source and real-time database
working with billions of exchanges.
 Berkeley DB: It is a high-performance and open-source
database providing scalability.
Document Model
 It store data in documents similar to JSON (JavaScript Object
Notation) objects.
 Each document contains pairs of fields and values.
 The values can typically be a variety of types including things like
strings, numbers, booleans, arrays, or objects, and their structures
typically align with objects developers are working with in code.
 Because of their variety of field value types and powerful query
languages, document databases are great for a wide variety of use cases
and can be used as a general purpose database.
 They can horizontally scale-out to accommodate large data volumes.
 MongoDB is consistently ranked as the world’s most popular NoSQL
database according to DB-engines and is an example of a document
database.
 Examples of Document Data Models :
 Amazon DocumentDB
 MongoDB
 Cosmos DB
 ArangoDB
 Couchbase Server
 CouchDB
Column Family Model
 It store data in tables, rows, and dynamic columns.
 Wide-column stores provide a lot of flexibility over
relational databases because each row is not required to
have the same columns.
 Many consider wide-column stores to be two-
dimensional key-value databases.
 Wide-column stores are great for when you need to store
large amounts of data and you can predict what your
query patterns will be.
 Wide-column stores are commonly used for storing
Internet of Things data and user profile data.
 Cassandra and HBase are two of the most popular wide-
column stores.
 the first level of the Column family contains the keys
that act as a row identifier that is used to select the
aggregate data. Whereas the second level values are
referred to as columns
Graph-Based Model
 Graph or network data models consider the
relationship between two pieces of information to be
as meaningful as the information itself.
 As such, this data model is really made for any
information you would typically represent in a chart.
 It uses relationships and nodes, where the data is the
information itself, and the connection is created
between the nodes.
 It store data in nodes and edges.
 Nodes typically store information about people,
places, and things while edges store information
about the relationships between the nodes.
 Graph databases excel in use cases where you need
to traverse relationships to look for patterns such as
social networks, fraud detection, and
recommendation engines.
 Neo4j and JanusGraph are examples of graph
databases.
 Graph-based Data Models are used in social
networking sites to store interconnections.
 It is used in fraud detection systems.
 This Data Model is also widely used in Networks and
IT operations.

Data modeling details
Distribution Models
 The primary driver of interest in NoSQL has been its
ability to run databases on a large cluster.
 As data volumes increase, it becomes more difficult
and expensive to scale up—buy a bigger server to run
the database on.
 A more appealing option is to scale out—run the
database on a cluster of servers.
 Aggregate orientation fits well with scaling out
because the aggregate is a natural unit to use for
distribution.
Distribution
Replication
master-slave peer-to-peer
Sharding
Single Server
 NO distribution at all
 Run the database on a single machine that handles
all the reads and writes to the data store
 it eliminates all the complexities
 it’s easy for operations people to manage and easy
for application developers to reason about
Single Server
NO
distribution
at all
Run the
database on
a single
machine
that handles
all the reads
and writes to
the data
store
it eliminates
all the
complexities
it’s easy for
operations
people to
manage and
easy for
application
developers
Graph
databases
 Replication
 Replication takes the same data and copies it over
multiple nodes.
 Two Types:
 master-slave
 peer-to-peer
Master-Slave Replication
 Replicate data across multiple nodes.
 One node is designated as the master, or primary.
 This master is the authoritative source for the data
and is usually responsible for processing any updates
to that data.
 The other nodes are slaves, or secondaries.
 A replication process synchronizes the slaves with
the master
 Advantages
 read-intensive dataset.
 You can scale horizontally to handle more read requests
by adding more slave nodes and ensuring that all read
requests are routed to the slaves
 read resilience
 Should the master fail, the slaves can still handle
read requests. Again, this is useful if most of
your data access is reads. The failure of the
master does eliminate the ability to handle
writes until either the master is restored or a
new master is appointed.
 appoint a slave
 replace a failed master means that master-slave replication
is useful even if you don’t need to scale out. All read and
write traffic can go to the master while the slave acts as a hot
backup.
 Masters can be appointed manually or
automatically
 Manual appointing typically means that when you configure
your cluster, you configure one node as the master.
 With automatic appointment, you create a cluster of nodes
and they elect one of themselves to be the master.
 Disadvantages:
 Inconsistency
 You have the danger that different clients, reading different
slaves, will see different values because the changes haven’t all
propagated to the slaves.
 if the master fails, any updates not passed on to the
backup are lost.
Peer-to-Peer Replication
 Peer-to-peer replication attacks these problems by
not having a master.
 All the replicas have equal weight, they can all accept
writes, and the loss of any of them doesn’t prevent
access to the data store.
 With a peer-to-peer replication cluster, you can ride
over node failures without losing access to data.
 Furthermore, you can easily add nodes to improve
your performance.
 Problem: Inconsistency:
 When you can write to two different places, you run
the risk that two people will attempt to update the
same record at the same time—a write-write conflict.
 Inconsistencies on read lead to problems but at least
they are relatively transient.
 Inconsistent writes are forever.
 Solution:
 1. we can ensure that whenever we write data, the
replicas coordinate to ensure to avoid a conflict.
don’t need all the replicas to agree on the write, just a
majority, so we can still survive losing a minority of
the replica nodes
2. we can decide to cope with an inconsistent write.
There are contexts when we can come up with policy
to merge inconsistent writes. In this case we can get
the full performance benefit of writing to any replica.
Sharding
 a busy data store is busy because different people are
accessing different parts of the dataset.
 In these circumstances we can support horizontal
scalability by putting different parts of the data onto
different servers—a technique that’s called
sharding
 In the ideal case, we have different users all talking
to different server nodes. Each user only has to talk
to one server, so gets rapid responses from that
server.
 The load is balanced out nicely between servers
 for example, if we have ten servers, each one only has to
handle 10% of the load.
 In order to get close to it we have to ensure that data
that’s accessed together is clumped together on the
same node and that these clumps are arranged on
the nodes to provide the best data access
 we design them to combine data that’s commonly
accessed together—so aggregates leap out as an
obvious unit of distribution
 most accesses of certain aggregates are based on a
physical location,
 you can place the data close to where it’s being
accessed.
 Another factor is trying to keep the load even. This
means that you should try to arrange aggregates so
they are evenly distributed across the nodes which all
get equal amounts of the load.
 Many NoSQL databases offer auto-sharding,
where the database takes on the responsibility
of allocating data to shards and ensuring that data
access goes to the right shard.
 Sharding improve both read and write performance
 Sharding provides a way to horizontally scale writes.
 A node failure makes that shard’s data unavailable.
 It’s not good to have a database with part of its data
missing
Combining Sharding and Replication
 If we use both master-slave replication and sharding
this means that we have multiple masters, but each
data item only has a single master.
 Depending on your configuration, you may choose a
node to be a master for some data and slaves for
others, or you may dedicate nodes for master or slave
duties.
Using master-slave replication together with sharding
 Using peer-to-peer replication and sharding is a
common strategy for column-family databases.
 In a scenario like this you might have tens or
hundreds of nodes in a cluster with data sharded
over them.
 A good starting point for peer-to-peer replication is
to have a replication factor of 3, so each shard is
present on three nodes.
 Should a node fail, then the shards on that node will
be built on the other nodes
Using peer-to-peer replication together with sharding
Consistency
 Biggest change from a centralized relational database
to a cluster oriented NoSQL
 Relational databases: strong consistency
 NoSQL systems: mostly eventual consistency
 Update Consistency
 Update is slightly differently, because each uses a slightly
different format.
 This issue is called a write-write conflict: two people
updating the same data item at the same time.
 When the writes reach the server, the server will serialize
them—decide to apply one, then the other.
 Two people updating the same data item at the same time If
the server serialize them: one is applied and immediately
overwritten by the other (lost update)
 Read-read (or simply read) conflict:
 Different people see different data at the same time
 Stale data: out of date
 Replication is a source of inconsistency
 Read-write conflict
 A read in the middle of two logically-related writes
 Solutions
 Pessimistic approach
 Prevent conflicts from occurring.
 Usually implemented with write locks managed by the system
 Optimistic approach
 Lets conflicts occur, but detects them and takes action to sort them
out
 Approaches (for write-write conflicts):
 conditional updates: test the value just before updating
 save both updates: record that they are in conflict and then
merge them
 Pessimistic v/s optimistic approach
 Concurrency involves a fundamental tradeoff between:
 consistency (avoiding errors such as update conflicts) and
 availability (responding quickly to clients).
 Pessimistic approaches often:
 severely degrade the responsiveness of a system
 leads to deadlocks, which are hard to prevent and debug.
 Forms of consistency
 Strong (or immediate) consistency :
 ACID transaction
 Logical consistency :
 No read-write conflicts (atomic transactions)
 Sequential consistency :
 Updates are serialized
 Session (or read-your-writes) consistency
 Within a user’s session
 Eventual consistency
 You may have replication inconsistencies but eventually all nodes
will be updated to the same value
 Relaxing Consistency
 Consistency is a Good Thing—but, sadly, sometimes
we have to sacrifice it.
 It is always possible to design a system to avoid
inconsistencies, but often impossible to do so
without making unbearable sacrifices in other
characteristics of the system.
 As a result, we often have to tradeoff consistency for
something else.
 Trading off consistency is a familiar concept even in
single-server relational database systems
 Enforce consistency is the transaction, and transactions
can provide strong consistency guarantees.
 Transaction systems usually come with the ability to
relax isolation levels, allowing queries to read data that
hasn’t been committed yet, and in practice we see most
applications relax consistency down from the highest
isolation level (serialized) in order to get effective
performance.
 We most commonly see people using the read-committed
transaction level, which eliminates some read-write
conflicts but allows others
The CAP Theorem
 Proposed by Eric Brewer in 2000 and given a formal
 proof by Seth Gilbert and Nancy Lynch [Lynch and
Gilbert] a couple of years later.
 “Given the three properties of Consistency,
Availability, and Partition tolerance, you can only get
two.”
 CAP states that in case of failures you can have at
most two of these three properties for any shared-
data system
 To scale out, you have to distribute resources.
 P is not really an option but rather a need
 The real selection is among Consistency or Availability
 In almost all cases, you would choose availability over
consistency.
 Consistency:
 all people see the same data at the same time
 Availability:
 guarantee that every request receives a response about
whether it was successful or failed.
 However, it does not guarantee that a read request returns the
most recent write.
 The more number of users a system can cater to better is the
availability.
 Partition tolerance:
 The system continues to operate despite communication
breakages that separate the cluster into partitions unable to
communicate with each other
 Out of these three guarantees, no system can provide
more than 2 guarantees.
 Since in the case of distributed systems, the
partitioning of the network is a must, the tradeoff is
always between consistency and availability.
With two breaks in the communication lines, the network
partitions into two groups.
 Martin and Pramod are both trying to book the last hotel
room on a system that uses peer-to-peer distribution
with two nodes (London for Martin and Mumbai for
Pramod).
 If we want to ensure consistency, then when Martin tries
to book his room on the London node, that node must
communicate with the Mumbai node before confirming
the booking.
 Essentially, both nodes must agree on the serialization of
their requests.
 This gives us consistency—but should the network link
break, then neither system can book any hotel room,
sacrificing availability.
 One way to improve availability is to designate one
node as the master for a particular hotel and ensure
all bookings are processed by that master. Should
that master be Mumbai, then Mumbai can still
process hotel bookings for that hotel and Pramod
will get the last room, whereas Martin can see the
inconsistent room information but cannot make a
booking (which would in this case cause an update
inconsistency). This is a lack of availability for
Martin.
 To gain more availability, we might allow both
systems to keep accepting hotel reservations even if a
link in the network breaks down.
 But this may cause both Martin a Pramod book the
same room => Inconsistency.
 But in this domain it might be tolerated somehow:
the travel company may tolerate some overbooking;
some hotels might always keep a few rooms clear
even when they are fully booked; Some hotels might
even cancel the booking with an apology once they
detected the conflict
 CA(Consistency and Availability)-
 The system prioritizes availability over consistency and can
respond with possibly stale data.
 Example databases: Cassandra, CouchDB, Riak, Voldemort.
 AP(Availability and Partition Tolerance)-
 The system prioritizes availability over consistency and can
respond with possibly stale data.
 The system can be distributed across multiple nodes and is
designed to operate reliably even in the face of network
partitions.
 Example databases: Amazon DynamoDB, Google Cloud
Spanner.
 CP(Consistency and Partition Tolerance)-
 The system prioritizes consistency over availability and
responds with the latest updated data.
 The system can be distributed across multiple nodes and is
designed to operate reliably even in the face of network
partitions.
 Example databases: Apache HBase, MongoDB, Redis.
Version Stamps
 Provide a means of detecting concurrency conflicts
 Each data item has a version stamp which gets incremented
each time the item is updated
 Before updating a data item, a process can check its version
stamp to see if it has been updated since it was last read
 Implementation methods
 Counter – requires a single master to “own” the counter
 GUID (Guaranteed Unique ID) – can be computed by any
node, but are large and cannot be compared directly
 Hash the contents of a resource
 Timestamp of last update – node clocks must be synchronized
 Counter – requires a single master to “own” the
counter
 Always incrementing it when you update the resource.
 Counters are useful since they make it easy to tell if one version is
more recent than another.
 On the other hand, they require the server to generate the counter
value, and also need a single master to ensure the counters aren’t
duplicated.
 For example.
 R1 to R6 are replicas and M1 is master (which replica R7)C is
counter variable. Its value is 3. a=1(database item).All replicas
are having database item value i.e. a= 1. Following figure shows
all replicas are having value of database item a=1 and C=3. The
above situation is shown in following figure .
 Now, Replica R3 want to update database item value to 6 i.e., a=6.
So, its counter value should be incremented first.
 Server at master side will increment the value of counter by 1. i.e.
C=4. So, the values of database item and counter are reflected at
replica R3 and Master.
 Now C=4 is the recent (highest counter value) value, so this update is
recent.
 Master will propagate to all replicas this update
 Now if any other replica wants to update database item value, then
its counter value should be updated first and communicated to
master then master will communicate it to the other replicas.
 GUID (Guaranteed Unique ID) – can be computed
by any node, but are large and cannot be compared
directly.
 The large random number which is to be used to be assured as
unique.
 This random number is created by taking combination of
dates, hardware information or any other way.
 These GUID i.e., random numbers will never be same.
 The disadvantage is that the random number is very large so
comparison of then for their recentness will be difficult.
 Hashing -Hash the contents of a resource
 With a big enough hash key size, a content hash can be globally
unique like a GUID and can also be generated by anyone;
 Advantage is that they are deterministic—any node will
generate the same content hash for same resource data.
 However, like GUIDs they can’t be directly compared for
recentness, and they can be lengthy
 For example, consider database item a=1, replicas R1 to R6. R7
replica is Master. Replica R1 wat to update database item a to
value 43. With the help of simple hashing function of mod 10
(10 is hash key) it will generate bucket. i.e., 43%10= 3 Now it is
easy to find out that replica R1 is having recent updates
 Timestamp of last update – node clocks must be
synchronized
 If any update is made timestamp will be checked.
 Their working is same as counters, and they can be compared for
their recentness.
 In this situation many replicas(machines) can generate timestamps.
 One thing must keep in mind that all machines’ clocks should be
synchronized with each other.
 If any replica will have bad clock (or its clock is not working properly
in synchronization manner) then data corruption problem will arise.
 Database administrators must check granularity of timestamps
otherwise timestamps will also get duplicated. Timestamps are good
if milliseconds precision will be used.
 Now, Replica R2 want to update database item value to 4 i.e.,
a=4. R2s timestamps is by millisecond updated and it is now
TS=-3-25-22:02:31:29,571 so it is compared with other
replicas time stamp and 571>570 (with millisecond precision)
so R2 contains most updated value
Map-Reduce
 When you have a cluster, you have lots of machines
to spread the computation over.
 However, you also still need to try to reduce the
amount of data that needs to be transferred across
the network.
 The map-reduce pattern is a way to organize
processing in such a way as to take advantage of
multiple machine on a cluster while keeping as much
processing and the data it needs together on the
same machine.
 This programming model gained prominence with
Google’s MapReduce framework [Dean and
Ghemawat, OSDI-04].
 A widely used open-source implementation is part of
the Apache Hadoop project.
 The name “map-reduce” reveals its inspiration from
the map and reduce operations on collections in
functional programming languages
 Map Reduce – benefits
 Complex details are abstracted away from the
 developer
– No file I/O
– No networking code
– No synchronization
 It’s scalable because you process one record at a time
 A record consists of a key and corresponding value
 Example:
 Let us consider the usual scenario of customers and
orders
We have chosen order as our
aggregate, with each order having
line items.
Each line item has product ID,
quantity, an the price charged.
Sales analysis people want to see a
product and its total revenue for the last
seven days.
 In order to get the product revenue report, you’ll
have to visit every machine in the cluster and
examine many records on each machine.
 This is exactly the kind of situation that calls for
map-reduce. Again, the first stage in a map-reduce
job is the map.
 A map is a function whose input is a single aggregate
and whose output is a bunch of key-value pairs.
 In this case, the input would be an order, and the
output would be key-value pairs corresponding to
the line items
 For this example, we are just selecting a value out of
the record, but there’s no reason why we can’t carry
out some arbitrarily complex function as part of the
map—providing it only depends on one aggregate’s
worth of data.
 Each such pair would have the product ID as the key and an embedded
map with the quantity and price as the value
 The reduce function takes multiple map outputs with the same key and
combines their values
 A map function might yield 1000 line items from
orders for “Database Refactoring”; the reduce
function would reduce down to one, with the totals
for the quantity and revenue.
 While the map function is limited to working only on
data from a single aggregate, the reduce function can
use all values emitted for a single key
Partitioning and Combining
 To increase parallelism, we can also partition the output of the
mappers and send each partition to a different reducer
(“shuffling”)
 To take advantage of this, the results of the mapper are
divided up based the key on each processing node.
 Typically, multiple keys are grouped together into partitions.
 The framework then takes the data from all the nodes for one
partition, combines it into a single group for that partition,
and sends it off to a reducer.
 Multiple reducers can then operate on the partitions in
parallel, with the final results merged together.
 (This step is also called “shuffling,” and the partitions are
sometimes referred to as “buckets” or “regions.”)
 The next problem we can deal with is the amount of data
being moved from node to node between the map and
reduce stages.
 Much of this data is repetitive, consisting of multiple key-
value pairs for the same key.
 A combiner function cuts this data down by combining
all the data for the same key into a single value A
combiner function is, in essence, a reducer function—
indeed, in many cases the same function can be used for
combining as the final reduction.
 The reduce function needs a special shape for this to
work: Its output must match its input. We call such a
function a combinable reducer.
 When you have combining reducers, the map-reduce
framework can safely run not only in parallel (to reduce
different partitions), but also in series to reduce the same
partition at different times and places.
 In addition to allowing combining to occur on a node before
data transmission, you can also start combining before
mappers have finished.
 This provides a good bit of extra flexibility to the map-reduce
processing. Some map-reduce frameworks require all
reducers to be combining reducers, which maximizes this
flexibility.
 If you need to do a noncombining reducer with one of these
frameworks, you’ll need to separate the processing into
pipelined map-reduce steps.

More Related Content

Similar to 2.Introduction to NOSQL (Core concepts).pptx

Vskills Apache Cassandra sample material
Vskills Apache Cassandra sample materialVskills Apache Cassandra sample material
Vskills Apache Cassandra sample materialVskills
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and UsesSuvradeep Rudra
 
Unit II -BIG DATA ANALYTICS.docx
Unit II -BIG DATA ANALYTICS.docxUnit II -BIG DATA ANALYTICS.docx
Unit II -BIG DATA ANALYTICS.docxvvpadhu
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sqlRam kumar
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
 
data base system to new data science lerne
data base system to new data science lernedata base system to new data science lerne
data base system to new data science lernetarunprajapati0t
 
NoSQL_Databases
NoSQL_DatabasesNoSQL_Databases
NoSQL_DatabasesRick Perry
 
Nosql Presentation.pdf for DBMS understanding
Nosql Presentation.pdf for DBMS understandingNosql Presentation.pdf for DBMS understanding
Nosql Presentation.pdf for DBMS understandingHUSNAINAHMAD39
 

Similar to 2.Introduction to NOSQL (Core concepts).pptx (20)

Vskills Apache Cassandra sample material
Vskills Apache Cassandra sample materialVskills Apache Cassandra sample material
Vskills Apache Cassandra sample material
 
No sq lv2
No sq lv2No sq lv2
No sq lv2
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and Uses
 
No sql
No sqlNo sql
No sql
 
Unit II -BIG DATA ANALYTICS.docx
Unit II -BIG DATA ANALYTICS.docxUnit II -BIG DATA ANALYTICS.docx
Unit II -BIG DATA ANALYTICS.docx
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
 
NoSQL Basics and MongDB
NoSQL Basics and  MongDBNoSQL Basics and  MongDB
NoSQL Basics and MongDB
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
 
the rising no sql technology
the rising no sql technologythe rising no sql technology
the rising no sql technology
 
No sql database
No sql databaseNo sql database
No sql database
 
Report 1.0.docx
Report 1.0.docxReport 1.0.docx
Report 1.0.docx
 
Unit 3 MongDB
Unit 3 MongDBUnit 3 MongDB
Unit 3 MongDB
 
data base system to new data science lerne
data base system to new data science lernedata base system to new data science lerne
data base system to new data science lerne
 
NoSQL_Databases
NoSQL_DatabasesNoSQL_Databases
NoSQL_Databases
 
Nosql Presentation.pdf for DBMS understanding
Nosql Presentation.pdf for DBMS understandingNosql Presentation.pdf for DBMS understanding
Nosql Presentation.pdf for DBMS understanding
 
Nosql
NosqlNosql
Nosql
 
Nosql
NosqlNosql
Nosql
 
All About Database v1.1
All About Database  v1.1All About Database  v1.1
All About Database v1.1
 
nosql.pptx
nosql.pptxnosql.pptx
nosql.pptx
 
NoSql
NoSqlNoSql
NoSql
 

More from RushikeshChikane2

3.Implementation with NOSQL databases Document Databases (Mongodb).pptx
3.Implementation with NOSQL databases Document Databases (Mongodb).pptx3.Implementation with NOSQL databases Document Databases (Mongodb).pptx
3.Implementation with NOSQL databases Document Databases (Mongodb).pptxRushikeshChikane2
 
Chapter 2 System Security.pptx
Chapter 2 System Security.pptxChapter 2 System Security.pptx
Chapter 2 System Security.pptxRushikeshChikane2
 
Security Architectures and Models.pptx
Security Architectures and Models.pptxSecurity Architectures and Models.pptx
Security Architectures and Models.pptxRushikeshChikane2
 
Social Media and Text Analytics
Social Media and Text AnalyticsSocial Media and Text Analytics
Social Media and Text AnalyticsRushikeshChikane2
 
Mining Frequent Patterns, Associations, and.pptx
 Mining Frequent Patterns, Associations, and.pptx Mining Frequent Patterns, Associations, and.pptx
Mining Frequent Patterns, Associations, and.pptxRushikeshChikane2
 
Machine Learning Overview.pptx
Machine Learning Overview.pptxMachine Learning Overview.pptx
Machine Learning Overview.pptxRushikeshChikane2
 
Chapter 4_Introduction to Patterns.ppt
Chapter 4_Introduction to Patterns.pptChapter 4_Introduction to Patterns.ppt
Chapter 4_Introduction to Patterns.pptRushikeshChikane2
 
Chapter 3_Architectural Styles.pptx
Chapter 3_Architectural Styles.pptxChapter 3_Architectural Styles.pptx
Chapter 3_Architectural Styles.pptxRushikeshChikane2
 
Chapter 2_Software Architecture.ppt
Chapter 2_Software Architecture.pptChapter 2_Software Architecture.ppt
Chapter 2_Software Architecture.pptRushikeshChikane2
 
Chapter 1_UML Introduction.ppt
Chapter 1_UML Introduction.pptChapter 1_UML Introduction.ppt
Chapter 1_UML Introduction.pptRushikeshChikane2
 

More from RushikeshChikane2 (10)

3.Implementation with NOSQL databases Document Databases (Mongodb).pptx
3.Implementation with NOSQL databases Document Databases (Mongodb).pptx3.Implementation with NOSQL databases Document Databases (Mongodb).pptx
3.Implementation with NOSQL databases Document Databases (Mongodb).pptx
 
Chapter 2 System Security.pptx
Chapter 2 System Security.pptxChapter 2 System Security.pptx
Chapter 2 System Security.pptx
 
Security Architectures and Models.pptx
Security Architectures and Models.pptxSecurity Architectures and Models.pptx
Security Architectures and Models.pptx
 
Social Media and Text Analytics
Social Media and Text AnalyticsSocial Media and Text Analytics
Social Media and Text Analytics
 
Mining Frequent Patterns, Associations, and.pptx
 Mining Frequent Patterns, Associations, and.pptx Mining Frequent Patterns, Associations, and.pptx
Mining Frequent Patterns, Associations, and.pptx
 
Machine Learning Overview.pptx
Machine Learning Overview.pptxMachine Learning Overview.pptx
Machine Learning Overview.pptx
 
Chapter 4_Introduction to Patterns.ppt
Chapter 4_Introduction to Patterns.pptChapter 4_Introduction to Patterns.ppt
Chapter 4_Introduction to Patterns.ppt
 
Chapter 3_Architectural Styles.pptx
Chapter 3_Architectural Styles.pptxChapter 3_Architectural Styles.pptx
Chapter 3_Architectural Styles.pptx
 
Chapter 2_Software Architecture.ppt
Chapter 2_Software Architecture.pptChapter 2_Software Architecture.ppt
Chapter 2_Software Architecture.ppt
 
Chapter 1_UML Introduction.ppt
Chapter 1_UML Introduction.pptChapter 1_UML Introduction.ppt
Chapter 1_UML Introduction.ppt
 

Recently uploaded

Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....rightmanforbloodline
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxMarkSteadman7
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseWSO2
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaWSO2
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceIES VE
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityVictorSzoltysek
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfdanishmna97
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 

Recently uploaded (20)

Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

2.Introduction to NOSQL (Core concepts).pptx

  • 1. I N T R O D U C T I O N T O N O S Q L Chapter 2
  • 2. What is a NoSQL Database?  NoSQL is a type of database management system (DBMS) that is designed to handle and store large volumes of unstructured and semi-structured data.  Unlike traditional relational databases that use tables with pre-defined schemas to store data, NoSQL databases use flexible data models that can adapt to changes in data structures and are capable of scaling horizontally to handle growing amounts of data.
  • 3.  A NoSQL Database, also known as a non SQL or non-relational Database is a non-tabular Database that stores data differently than the tabular relations used in relational databases.  Companies widely used NoSQL Database generally for big Data and real-time web applications.
  • 4.  NoSQL Databases offer a simple design, horizontal scaling for clustering machines, and limit the object-relational impedance mismatch.  It uses different data structures from those used by relational Databases making some operations faster.  NoSQL Databases are designed to be flexible, scalable, and capable of rapidly responding to the data management demands of modern businesses.
  • 5. Why NoSQL  Dynamic schema: NoSQL databases do not have a fixed schema and can accommodate changing data structures without the need for migrations or schema alterations.  Horizontal scalability: NoSQL databases are designed to scale out by adding more nodes to a database cluster, making them well- suited for handling large amounts of data and high levels of traffic.  Document-based: Some NoSQL databases, such as MongoDB, use a document-based data model, where data is stored in a scalessemi-structured format, such as JSON or BSON.  Key-value-based: Other NoSQL databases, such as Redis, use a key-value data model, where data is stored as a collection of key- value pairs.  Column-based: Some NoSQL databases, such as Cassandra, use a column-based data model, where data is organized into columns instead of rows.
  • 6.  Distributed and high availability: NoSQL databases are often designed to be highly available and to automatically handle node failures and data replication across multiple nodes in a database cluster.  Flexibility: NoSQL databases allow developers to store and retrieve data in a flexible and dynamic manner, with support for multiple data types and changing data structures.  Performance: NoSQL databases are optimized for high performance and can handle a high volume of reads and writes, making them suitable for big data and real-time applications.
  • 7. Aggregate Data Models  Aggregate means a collection of objects that are treated as a unit. In NoSQL Databases, an aggregate is a collection of data that interact as a unit. Moreover, these units of data or aggregates of data form the boundaries for the ACID operations.  Aggregate Data Models in NoSQL make it easier for the Databases to manage data storage over the clusters as the aggregate data or unit can now reside on any of the machines. Whenever data is retrieved from the Database all the data comes along with the Aggregate Data Models in NoSQL.  Aggregate Data Models in NoSQL don’t support ACID transactions and sacrifice one of the ACID properties. With the help of Aggregate Data Models in NoSQL, you can easily perform OLAP operations on the Database
  • 8. Types of Aggregate Data Models Key-Value Model Document Model Column Family Model Graph-Based Model Types
  • 9. Key-Value Model  The Key-Value Data Model contains the key or an ID used to access or fetch the data of the aggregates corresponding to the key.  In this Aggregate Data Models in NoSQL, the data of the aggregates are secure and encrypted and can be decrypted with a Key
  • 10.  They are a simpler type of database where each item contains keys and values.  A value can typically only be retrieved by referencing its key, so learning how to query for a specific key- value pair is typically simple.  Key-value databases are great for use cases where you need to store large amounts of data but you don’t need to perform complex queries to retrieve it.  Common use cases include storing user preferences or caching. Redis and DynanoDB are popular key- value databases.
  • 11.  examples of key-value databases:  Couchbase: It permits SQL-style querying and searching for text.  Amazon DynamoDB: The key-value database which is mostly used is Amazon DynamoDB as it is a trusted database used by a large number of users. It can easily handle a large number of requests every day and it also provides various security options.  Riak: It is the database used to develop applications.  Aerospike: It is an open-source and real-time database working with billions of exchanges.  Berkeley DB: It is a high-performance and open-source database providing scalability.
  • 12. Document Model  It store data in documents similar to JSON (JavaScript Object Notation) objects.  Each document contains pairs of fields and values.  The values can typically be a variety of types including things like strings, numbers, booleans, arrays, or objects, and their structures typically align with objects developers are working with in code.  Because of their variety of field value types and powerful query languages, document databases are great for a wide variety of use cases and can be used as a general purpose database.  They can horizontally scale-out to accommodate large data volumes.  MongoDB is consistently ranked as the world’s most popular NoSQL database according to DB-engines and is an example of a document database.
  • 13.
  • 14.  Examples of Document Data Models :  Amazon DocumentDB  MongoDB  Cosmos DB  ArangoDB  Couchbase Server  CouchDB
  • 15. Column Family Model  It store data in tables, rows, and dynamic columns.  Wide-column stores provide a lot of flexibility over relational databases because each row is not required to have the same columns.  Many consider wide-column stores to be two- dimensional key-value databases.  Wide-column stores are great for when you need to store large amounts of data and you can predict what your query patterns will be.  Wide-column stores are commonly used for storing Internet of Things data and user profile data.  Cassandra and HBase are two of the most popular wide- column stores.
  • 16.  the first level of the Column family contains the keys that act as a row identifier that is used to select the aggregate data. Whereas the second level values are referred to as columns
  • 17.
  • 18.
  • 19.
  • 20. Graph-Based Model  Graph or network data models consider the relationship between two pieces of information to be as meaningful as the information itself.  As such, this data model is really made for any information you would typically represent in a chart.  It uses relationships and nodes, where the data is the information itself, and the connection is created between the nodes.
  • 21.  It store data in nodes and edges.  Nodes typically store information about people, places, and things while edges store information about the relationships between the nodes.  Graph databases excel in use cases where you need to traverse relationships to look for patterns such as social networks, fraud detection, and recommendation engines.  Neo4j and JanusGraph are examples of graph databases.
  • 22.
  • 23.
  • 24.  Graph-based Data Models are used in social networking sites to store interconnections.  It is used in fraud detection systems.  This Data Model is also widely used in Networks and IT operations. 
  • 26. Distribution Models  The primary driver of interest in NoSQL has been its ability to run databases on a large cluster.  As data volumes increase, it becomes more difficult and expensive to scale up—buy a bigger server to run the database on.  A more appealing option is to scale out—run the database on a cluster of servers.  Aggregate orientation fits well with scaling out because the aggregate is a natural unit to use for distribution.
  • 28. Single Server  NO distribution at all  Run the database on a single machine that handles all the reads and writes to the data store  it eliminates all the complexities  it’s easy for operations people to manage and easy for application developers to reason about
  • 29. Single Server NO distribution at all Run the database on a single machine that handles all the reads and writes to the data store it eliminates all the complexities it’s easy for operations people to manage and easy for application developers Graph databases
  • 30.  Replication  Replication takes the same data and copies it over multiple nodes.  Two Types:  master-slave  peer-to-peer
  • 31. Master-Slave Replication  Replicate data across multiple nodes.  One node is designated as the master, or primary.  This master is the authoritative source for the data and is usually responsible for processing any updates to that data.  The other nodes are slaves, or secondaries.  A replication process synchronizes the slaves with the master
  • 32.
  • 33.  Advantages  read-intensive dataset.  You can scale horizontally to handle more read requests by adding more slave nodes and ensuring that all read requests are routed to the slaves  read resilience  Should the master fail, the slaves can still handle read requests. Again, this is useful if most of your data access is reads. The failure of the master does eliminate the ability to handle writes until either the master is restored or a new master is appointed.
  • 34.  appoint a slave  replace a failed master means that master-slave replication is useful even if you don’t need to scale out. All read and write traffic can go to the master while the slave acts as a hot backup.  Masters can be appointed manually or automatically  Manual appointing typically means that when you configure your cluster, you configure one node as the master.  With automatic appointment, you create a cluster of nodes and they elect one of themselves to be the master.
  • 35.  Disadvantages:  Inconsistency  You have the danger that different clients, reading different slaves, will see different values because the changes haven’t all propagated to the slaves.  if the master fails, any updates not passed on to the backup are lost.
  • 36. Peer-to-Peer Replication  Peer-to-peer replication attacks these problems by not having a master.  All the replicas have equal weight, they can all accept writes, and the loss of any of them doesn’t prevent access to the data store.  With a peer-to-peer replication cluster, you can ride over node failures without losing access to data.  Furthermore, you can easily add nodes to improve your performance.
  • 37.
  • 38.  Problem: Inconsistency:  When you can write to two different places, you run the risk that two people will attempt to update the same record at the same time—a write-write conflict.  Inconsistencies on read lead to problems but at least they are relatively transient.  Inconsistent writes are forever.
  • 39.  Solution:  1. we can ensure that whenever we write data, the replicas coordinate to ensure to avoid a conflict. don’t need all the replicas to agree on the write, just a majority, so we can still survive losing a minority of the replica nodes 2. we can decide to cope with an inconsistent write. There are contexts when we can come up with policy to merge inconsistent writes. In this case we can get the full performance benefit of writing to any replica.
  • 40. Sharding  a busy data store is busy because different people are accessing different parts of the dataset.  In these circumstances we can support horizontal scalability by putting different parts of the data onto different servers—a technique that’s called sharding
  • 41.
  • 42.  In the ideal case, we have different users all talking to different server nodes. Each user only has to talk to one server, so gets rapid responses from that server.  The load is balanced out nicely between servers  for example, if we have ten servers, each one only has to handle 10% of the load.
  • 43.  In order to get close to it we have to ensure that data that’s accessed together is clumped together on the same node and that these clumps are arranged on the nodes to provide the best data access  we design them to combine data that’s commonly accessed together—so aggregates leap out as an obvious unit of distribution
  • 44.  most accesses of certain aggregates are based on a physical location,  you can place the data close to where it’s being accessed.  Another factor is trying to keep the load even. This means that you should try to arrange aggregates so they are evenly distributed across the nodes which all get equal amounts of the load.
  • 45.  Many NoSQL databases offer auto-sharding, where the database takes on the responsibility of allocating data to shards and ensuring that data access goes to the right shard.  Sharding improve both read and write performance  Sharding provides a way to horizontally scale writes.  A node failure makes that shard’s data unavailable.  It’s not good to have a database with part of its data missing
  • 46. Combining Sharding and Replication  If we use both master-slave replication and sharding this means that we have multiple masters, but each data item only has a single master.  Depending on your configuration, you may choose a node to be a master for some data and slaves for others, or you may dedicate nodes for master or slave duties.
  • 47. Using master-slave replication together with sharding
  • 48.  Using peer-to-peer replication and sharding is a common strategy for column-family databases.  In a scenario like this you might have tens or hundreds of nodes in a cluster with data sharded over them.  A good starting point for peer-to-peer replication is to have a replication factor of 3, so each shard is present on three nodes.  Should a node fail, then the shards on that node will be built on the other nodes
  • 49. Using peer-to-peer replication together with sharding
  • 50. Consistency  Biggest change from a centralized relational database to a cluster oriented NoSQL  Relational databases: strong consistency  NoSQL systems: mostly eventual consistency
  • 51.  Update Consistency  Update is slightly differently, because each uses a slightly different format.  This issue is called a write-write conflict: two people updating the same data item at the same time.  When the writes reach the server, the server will serialize them—decide to apply one, then the other.  Two people updating the same data item at the same time If the server serialize them: one is applied and immediately overwritten by the other (lost update)
  • 52.  Read-read (or simply read) conflict:  Different people see different data at the same time  Stale data: out of date  Replication is a source of inconsistency
  • 53.  Read-write conflict  A read in the middle of two logically-related writes
  • 54.  Solutions  Pessimistic approach  Prevent conflicts from occurring.  Usually implemented with write locks managed by the system  Optimistic approach  Lets conflicts occur, but detects them and takes action to sort them out  Approaches (for write-write conflicts):  conditional updates: test the value just before updating  save both updates: record that they are in conflict and then merge them
  • 55.  Pessimistic v/s optimistic approach  Concurrency involves a fundamental tradeoff between:  consistency (avoiding errors such as update conflicts) and  availability (responding quickly to clients).  Pessimistic approaches often:  severely degrade the responsiveness of a system  leads to deadlocks, which are hard to prevent and debug.
  • 56.  Forms of consistency  Strong (or immediate) consistency :  ACID transaction  Logical consistency :  No read-write conflicts (atomic transactions)  Sequential consistency :  Updates are serialized  Session (or read-your-writes) consistency  Within a user’s session  Eventual consistency  You may have replication inconsistencies but eventually all nodes will be updated to the same value
  • 57.  Relaxing Consistency  Consistency is a Good Thing—but, sadly, sometimes we have to sacrifice it.  It is always possible to design a system to avoid inconsistencies, but often impossible to do so without making unbearable sacrifices in other characteristics of the system.  As a result, we often have to tradeoff consistency for something else.  Trading off consistency is a familiar concept even in single-server relational database systems
  • 58.  Enforce consistency is the transaction, and transactions can provide strong consistency guarantees.  Transaction systems usually come with the ability to relax isolation levels, allowing queries to read data that hasn’t been committed yet, and in practice we see most applications relax consistency down from the highest isolation level (serialized) in order to get effective performance.  We most commonly see people using the read-committed transaction level, which eliminates some read-write conflicts but allows others
  • 59. The CAP Theorem  Proposed by Eric Brewer in 2000 and given a formal  proof by Seth Gilbert and Nancy Lynch [Lynch and Gilbert] a couple of years later.  “Given the three properties of Consistency, Availability, and Partition tolerance, you can only get two.”
  • 60.  CAP states that in case of failures you can have at most two of these three properties for any shared- data system  To scale out, you have to distribute resources.  P is not really an option but rather a need  The real selection is among Consistency or Availability  In almost all cases, you would choose availability over consistency.
  • 61.  Consistency:  all people see the same data at the same time  Availability:  guarantee that every request receives a response about whether it was successful or failed.  However, it does not guarantee that a read request returns the most recent write.  The more number of users a system can cater to better is the availability.
  • 62.  Partition tolerance:  The system continues to operate despite communication breakages that separate the cluster into partitions unable to communicate with each other  Out of these three guarantees, no system can provide more than 2 guarantees.  Since in the case of distributed systems, the partitioning of the network is a must, the tradeoff is always between consistency and availability.
  • 63. With two breaks in the communication lines, the network partitions into two groups.
  • 64.  Martin and Pramod are both trying to book the last hotel room on a system that uses peer-to-peer distribution with two nodes (London for Martin and Mumbai for Pramod).  If we want to ensure consistency, then when Martin tries to book his room on the London node, that node must communicate with the Mumbai node before confirming the booking.  Essentially, both nodes must agree on the serialization of their requests.  This gives us consistency—but should the network link break, then neither system can book any hotel room, sacrificing availability.
  • 65.  One way to improve availability is to designate one node as the master for a particular hotel and ensure all bookings are processed by that master. Should that master be Mumbai, then Mumbai can still process hotel bookings for that hotel and Pramod will get the last room, whereas Martin can see the inconsistent room information but cannot make a booking (which would in this case cause an update inconsistency). This is a lack of availability for Martin.
  • 66.  To gain more availability, we might allow both systems to keep accepting hotel reservations even if a link in the network breaks down.  But this may cause both Martin a Pramod book the same room => Inconsistency.  But in this domain it might be tolerated somehow: the travel company may tolerate some overbooking; some hotels might always keep a few rooms clear even when they are fully booked; Some hotels might even cancel the booking with an apology once they detected the conflict
  • 67.
  • 68.  CA(Consistency and Availability)-  The system prioritizes availability over consistency and can respond with possibly stale data.  Example databases: Cassandra, CouchDB, Riak, Voldemort.  AP(Availability and Partition Tolerance)-  The system prioritizes availability over consistency and can respond with possibly stale data.  The system can be distributed across multiple nodes and is designed to operate reliably even in the face of network partitions.  Example databases: Amazon DynamoDB, Google Cloud Spanner.
  • 69.  CP(Consistency and Partition Tolerance)-  The system prioritizes consistency over availability and responds with the latest updated data.  The system can be distributed across multiple nodes and is designed to operate reliably even in the face of network partitions.  Example databases: Apache HBase, MongoDB, Redis.
  • 70. Version Stamps  Provide a means of detecting concurrency conflicts  Each data item has a version stamp which gets incremented each time the item is updated  Before updating a data item, a process can check its version stamp to see if it has been updated since it was last read  Implementation methods  Counter – requires a single master to “own” the counter  GUID (Guaranteed Unique ID) – can be computed by any node, but are large and cannot be compared directly  Hash the contents of a resource  Timestamp of last update – node clocks must be synchronized
  • 71.  Counter – requires a single master to “own” the counter  Always incrementing it when you update the resource.  Counters are useful since they make it easy to tell if one version is more recent than another.  On the other hand, they require the server to generate the counter value, and also need a single master to ensure the counters aren’t duplicated.
  • 72.  For example.  R1 to R6 are replicas and M1 is master (which replica R7)C is counter variable. Its value is 3. a=1(database item).All replicas are having database item value i.e. a= 1. Following figure shows all replicas are having value of database item a=1 and C=3. The above situation is shown in following figure .
  • 73.  Now, Replica R3 want to update database item value to 6 i.e., a=6. So, its counter value should be incremented first.  Server at master side will increment the value of counter by 1. i.e. C=4. So, the values of database item and counter are reflected at replica R3 and Master.  Now C=4 is the recent (highest counter value) value, so this update is recent.  Master will propagate to all replicas this update  Now if any other replica wants to update database item value, then its counter value should be updated first and communicated to master then master will communicate it to the other replicas.
  • 74.
  • 75.  GUID (Guaranteed Unique ID) – can be computed by any node, but are large and cannot be compared directly.  The large random number which is to be used to be assured as unique.  This random number is created by taking combination of dates, hardware information or any other way.  These GUID i.e., random numbers will never be same.  The disadvantage is that the random number is very large so comparison of then for their recentness will be difficult.
  • 76.  Hashing -Hash the contents of a resource  With a big enough hash key size, a content hash can be globally unique like a GUID and can also be generated by anyone;  Advantage is that they are deterministic—any node will generate the same content hash for same resource data.  However, like GUIDs they can’t be directly compared for recentness, and they can be lengthy
  • 77.  For example, consider database item a=1, replicas R1 to R6. R7 replica is Master. Replica R1 wat to update database item a to value 43. With the help of simple hashing function of mod 10 (10 is hash key) it will generate bucket. i.e., 43%10= 3 Now it is easy to find out that replica R1 is having recent updates
  • 78.
  • 79.  Timestamp of last update – node clocks must be synchronized  If any update is made timestamp will be checked.  Their working is same as counters, and they can be compared for their recentness.  In this situation many replicas(machines) can generate timestamps.  One thing must keep in mind that all machines’ clocks should be synchronized with each other.  If any replica will have bad clock (or its clock is not working properly in synchronization manner) then data corruption problem will arise.  Database administrators must check granularity of timestamps otherwise timestamps will also get duplicated. Timestamps are good if milliseconds precision will be used.
  • 80.
  • 81.  Now, Replica R2 want to update database item value to 4 i.e., a=4. R2s timestamps is by millisecond updated and it is now TS=-3-25-22:02:31:29,571 so it is compared with other replicas time stamp and 571>570 (with millisecond precision) so R2 contains most updated value
  • 82.
  • 83. Map-Reduce  When you have a cluster, you have lots of machines to spread the computation over.  However, you also still need to try to reduce the amount of data that needs to be transferred across the network.  The map-reduce pattern is a way to organize processing in such a way as to take advantage of multiple machine on a cluster while keeping as much processing and the data it needs together on the same machine.
  • 84.  This programming model gained prominence with Google’s MapReduce framework [Dean and Ghemawat, OSDI-04].  A widely used open-source implementation is part of the Apache Hadoop project.  The name “map-reduce” reveals its inspiration from the map and reduce operations on collections in functional programming languages
  • 85.  Map Reduce – benefits  Complex details are abstracted away from the  developer – No file I/O – No networking code – No synchronization  It’s scalable because you process one record at a time  A record consists of a key and corresponding value
  • 86.  Example:  Let us consider the usual scenario of customers and orders We have chosen order as our aggregate, with each order having line items. Each line item has product ID, quantity, an the price charged. Sales analysis people want to see a product and its total revenue for the last seven days.
  • 87.  In order to get the product revenue report, you’ll have to visit every machine in the cluster and examine many records on each machine.  This is exactly the kind of situation that calls for map-reduce. Again, the first stage in a map-reduce job is the map.  A map is a function whose input is a single aggregate and whose output is a bunch of key-value pairs.  In this case, the input would be an order, and the output would be key-value pairs corresponding to the line items
  • 88.  For this example, we are just selecting a value out of the record, but there’s no reason why we can’t carry out some arbitrarily complex function as part of the map—providing it only depends on one aggregate’s worth of data.
  • 89.  Each such pair would have the product ID as the key and an embedded map with the quantity and price as the value
  • 90.  The reduce function takes multiple map outputs with the same key and combines their values
  • 91.
  • 92.  A map function might yield 1000 line items from orders for “Database Refactoring”; the reduce function would reduce down to one, with the totals for the quantity and revenue.  While the map function is limited to working only on data from a single aggregate, the reduce function can use all values emitted for a single key
  • 93. Partitioning and Combining  To increase parallelism, we can also partition the output of the mappers and send each partition to a different reducer (“shuffling”)  To take advantage of this, the results of the mapper are divided up based the key on each processing node.  Typically, multiple keys are grouped together into partitions.  The framework then takes the data from all the nodes for one partition, combines it into a single group for that partition, and sends it off to a reducer.  Multiple reducers can then operate on the partitions in parallel, with the final results merged together.  (This step is also called “shuffling,” and the partitions are sometimes referred to as “buckets” or “regions.”)
  • 94.  The next problem we can deal with is the amount of data being moved from node to node between the map and reduce stages.  Much of this data is repetitive, consisting of multiple key- value pairs for the same key.  A combiner function cuts this data down by combining all the data for the same key into a single value A combiner function is, in essence, a reducer function— indeed, in many cases the same function can be used for combining as the final reduction.  The reduce function needs a special shape for this to work: Its output must match its input. We call such a function a combinable reducer.
  • 95.
  • 96.  When you have combining reducers, the map-reduce framework can safely run not only in parallel (to reduce different partitions), but also in series to reduce the same partition at different times and places.  In addition to allowing combining to occur on a node before data transmission, you can also start combining before mappers have finished.  This provides a good bit of extra flexibility to the map-reduce processing. Some map-reduce frameworks require all reducers to be combining reducers, which maximizes this flexibility.  If you need to do a noncombining reducer with one of these frameworks, you’ll need to separate the processing into pipelined map-reduce steps.