UNIT II -NO SQL Data Management
2
Topic Name (CO2)
• Introduction to NoSQL
• NoSQL Databases
07/14/2025 3
Topic Objective (CO2)
After completion of this topic, students will be able to understand:
• What is NoSQL?
• NoSQL Databases
07/14/2025 4
What is NOSQL? (CO2)
• NoSQL database, also called Not Only SQL, is an approach to data
management and database design that's useful for very large sets of
distributed data.
• NoSQL is not a relational database. The reality is that a relational
database model may not be the best solution for all situations.
• The easiest way to think of NoSQL, is that of a database which does
not adhering to the traditional relational database management
system (RDMS) structure.
07/14/2025 Hirdesh Sharma RCA E45 Big Data
Unit: 2
5
• Traditional Relational Databases (RDBMS)
• Use tables with rows and columns.
• Data is organized in a fixed schema (structure).
• Relationships between data are defined using keys (primary
keys, foreign keys).
• Examples: MySQL, PostgreSQL, Oracle, SQL Server.
What is NOSQL?
07/14/2025 Hirdesh Sharma RCA E45 Big Data
Unit: 2
6
• NoSQL Databases
• Do not follow this strict table-based, relational structure.
• They are more flexible in how data is stored and organized.
• Can store data in various formats like:
– Documents (e.g., JSON or BSON) — MongoDB
– Key-Value pairs — Redis, DynamoDB
– Wide-column stores — Cassandra, HBase
– Graphs — Neo4j
• No fixed schema required — data can evolve easily without
altering a strict structure.
• Designed for scalability, high performance, and handling large
volumes of diverse data types.
What is NOSQL?
07/14/2025 7
Why Are NoSQL Databases Interesting? / Why we should use
Nosql? / when to use Nosql?
There are several reasons why people consider using a NoSQL
database:
• Application development productivity
• Large data
• Analytics
Why Are NoSQL Databases Interesting? (CO2)
07/14/2025 8
• Scalability: NoSQL databases are designed to scale; it’s one of the
primary reasons that people choose a NoSQL database.
• Massive write performance: This is probably the canonical usage
based on Google's influence, which implies key-value access,
MapReduce, replication, fault tolerance, consistency issues, and all
the rest. For faster writes in-memory systems can be used.
• Fast key-value access: This is probably the second most cited
virtue of NoSQL in the general mind set.
Why Are NoSQL Databases Interesting? (CO2)
07/14/2025 9
Why Are NoSQL Databases Interesting? (CO2)
• Flexible data model and flexible datatypes: NoSQL products
support a whole range of new data types. We have: column-oriented,
graph, advanced data structures, document-oriented, and key-value.
Complex objects can be easily stored without a lot of mapping.
• Schema migration
• Write availability
• Easier maintainability, administration and operations
07/14/2025 10
• No single point of failure
• Generally available parallel computing
• Programmer ease of use: Accessing your data should be easy.
Programmers grok keys, values, JSON, Javascript stored procedures,
HTTP, and so on. NoSQL is for programmers.
Why Are NoSQL Databases Interesting? (CO2)
07/14/2025 11
• Use the right data model for the right problem: Different data
models are used to solve different problems.
• Distributed systems and cloud computing support: Not everyone
is worried about scale or performance over and above that which
can be achieved by non-NoSQL systems.
Why Are NoSQL Databases Interesting? (CO2)
07/14/2025 12
Difference between SQL and NoSQL (CO2)
• SQL databases are primarily called as Relational Databases
(RDBMS); whereas NoSQL database are primarily called as non-
relational or distributed database.
• SQL databases are table based databases whereas NoSQL databases
are document based, key-value pairs, graph databases or wide-
column stores.
07/14/2025 13
Difference between SQL and NoSQL (CO2)
• SQL databases are scaled by increasing the horse-power of the
hardware. NoSQL databases are scaled by increasing the databases
servers in the pool of resources to reduce the load.
• SQL database examples: MySql, Oracle, Sqlite, Postgres and MS-
SQL. NoSQL database examples: MongoDB, BigTable, Redis,
RavenDb, Cassandra, Hbase, Neo4j and CouchDb.
07/14/2025 14
• For the type of data to be stored: SQL databases are not best fit for
hierarchical data storage. But, NoSQL database fits better for the
hierarchical data storage as it follows the key-value pair way of
storing data similar to JSON data. NoSQL database are highly
preferred for large data set (i.e for big data).
Difference between SQL and NoSQL (CO2)
07/14/2025 15
• For properties: SQL databases emphasizes on ACID properties
( Atomicity, Consistency, Isolation and Durability) whereas the
NoSQL database follows the Brewers CAP theorem ( Consistency,
Availability and Partition tolerance ).
• For DB types: On a high-level, we can classify SQL databases as
either open-source or close-sourced from commercial vendors.
NoSQL databases can be classified on the basis of way of storing
data as graph databases, key-value store databases, document store
databases, column store database and XML databases.
Difference between SQL and NoSQL (CO2)
07/14/2025 16
Hirdesh Sharma RCA E45 Big Data Unit: 2
Type of NoSQL Database (CO2)
There are four general types of NoSQL databases, each with their
own specific attributes:
1. Key-Value storage
2. Document Databases
3. Column Storage
4. Graph Storage
07/14/2025 17
• Key-Value storage: This is the first category of NoSQL database.
Key-value stores have a simple data model, which allow clients to
put a map/dictionary and request value par key. In the key-value
storage, each key has to be unique to provide non-ambiguous
identification of values. For example:
Type of NoSQL Database (CO2)
07/14/2025 18
• Document databases: In the document database NoSQL store
document in JSON format. JSON-based document are store in
completely different sets of attributes can be stored together, which
stores highly unstructured data as named value pairs and
applications that look at user behavior, actions, and logs in real time.
Type of NoSQL Database (CO2)
07/14/2025 19
• Columns storage: Columnar databases are almost like tabular
databases. Thus keys in wide column store scan have many
dimensions, resulting in a structure similar to a multi-dimensional,
associative array. Shown in below example storing data in a wide
column system using a two-dimensional key.
Type of NoSQL Database (CO2)
07/14/2025 20
Type of NoSQL Database (CO2)
• Graph storage: Graph databases are best suited for representing
data with a high, yet flexible number of interconnections, especially
when information about those interconnections is at least as
important as there presented data. In NoSQL database, data is stored
in a graph like structures in graph databases, so that the data can be
made easily accessible. Graph databases are commonly used on
social networking sites. As show in below figure.
07/14/2025 21
NoSQL Database Examples (CO1)
07/14/2025 22
Advantages
– Data persistence
– Concurrency – ACID, transactions, etc.
– Integration across multiple applications
– Standard Model – tables and SQL
Disadvantages
– Impedance mismatch
– Integration databases vs. application databases
– Not designed for clustering
Pros and Cons of Relational Databases (CO2)
07/14/2025 23
CONS
Impedance Mismatch
In traditional relational databases, data is stored in tables with rows and
columns, while in most programming languages (especially object-oriented
ones), data is represented as objects.
Why it's a disadvantage:
– There’s a "mismatch" between how data is structured in the database (tables) and
how it's structured in the application (objects).
– This often requires extra work to map database tables to programming objects,
called Object-Relational Mapping (ORM).
– NoSQL tries to solve this by storing data in formats like JSON documents that
are closer to application objects, but sometimes this mismatch still exists,
especially when mixing NoSQL with relational data or legacy systems.
07/14/2025 24
Integration Databases vs. Application Databases
– Some databases are designed as integration points to bring data together
from multiple sources (integration databases).
– Others are designed primarily to support a specific application's data
needs (application databases).
Why it's a disadvantage:
– NoSQL databases are usually application-specific, optimized for a
particular app’s data access patterns.
– This can make it hard to integrate data from multiple systems or do
complex queries spanning different data sources, unlike relational
databases which excel at integration.
– So if your use case requires combining or analyzing data from many
systems, NoSQL might be less ideal.
07/14/2025 25
Not Designed for Clustering of NoSQL (in some cases)
– Clustering means running multiple database servers together to
work as a single system, providing high availability and scalability.
– Some early or simpler NoSQL databases were not built to support
clustering or distributed operation well.
Why it's a disadvantage:
– Without good clustering, databases can become a single point of
failure or struggle with scaling as data grows.
– This means they might not handle big loads or uptime requirements
as well as relational databases with mature clustering features.
– However, many modern NoSQL systems do support clustering and
distributed data, but it depends on the specific database.
07/14/2025 26
Some common characteristics of nosql include:
• Does not use the relational model (mostly)
• Generally open source projects (currently)
• Driven by the need to run on clusters
• Built for the need to run 21st century web properties
• Schema-less
• Polygot persistence
• Auto Sharding
Characteristics of NoSQL (CO2)
07/14/2025 27
• The point of view of using different data stores in different
circumstances is known as Polyglot Persistence.
• Polyglot persistence is commonly used to define this hybrid
approach.
• The definition of polyglot is “someone who speaks or writes several
languages.” The term polyglot is redefined for big data as a set of
applications that use several core database technologies.
Polygot Persistence (CO2)
07/14/2025 28
Polygot Persistence (CO2)
07/14/2025 29
• What license is Hadoop distributed under?
a) Apache License 2.0
b) Mozilla Public License
c) Shareware
d) Commercial
• Which of the following genres does Hadoop produce?
a) Distributed file system
b) JAX-RS
c) Java Message Service
d) Relational Database Management System
Daily Quiz
07/14/2025 30
Recap
• NoSQL database, also called Not Only SQL, is an approach to data
management and database design that's useful for very large sets of
distributed data.
07/14/2025 31
Hirdesh Sharma RCA E45 Big Data Unit: 2
Topic Name (CO2)
• NoSQL Data Models
07/14/2025 32
Hirdesh Sharma RCA E45 Big Data Unit: 2
Topic Objective (CO2)
After completion of this topic, students will be able to understand:
• NoSQL Data Models
• Aggregate Data Models
07/14/2025 33
• NoSQL databases have a very different model. For example, a
document-oriented NoSQL database takes the data you want to store
and aggregates it into documents using the JSON format.
• Each JSON document can be thought of as an object to be used by
your application.
• A JSON document might, for example, take all the data stored in a
row that spans 20 tables of a relational database and aggregate it into
a single document/object.
NoSQL Data Model (CO2)
07/14/2025 34
NoSQL Data Model (CO2)
07/14/2025 35
• Another major difference is that relational technologies have rigid
schemas while NoSQL models are schemaless.
• The exact opposite of the behavior desired in the Big Data era,
where application developers need to constantly – and rapidly –
incorporate new types of data to enrich their apps.
NoSQL Data Model (CO2)
07/14/2025 36
Aggregate Data Model in NoSQL (CO2)
• Data Model: A data model is the model through which we perceive
and manipulate our data.
• Relational Data Model: The relational model takes the information
that we want to store and divides it into tuples.
• Aggregate Model: Aggregate is a term that comes from Domain-
Driven Design, an aggregate is a collection of related objects.
07/14/2025 37
Aggregate Data Model in NoSQL (CO2)
• Atomic property holds within an aggregate.
• Communication with data storage happens in unit of aggregate.
Example of Relations and Aggregates
• Let’s assume we have to build an e-commerce website; we are going
to be selling items directly to customers over the web.
• We can use this scenario to model the data using a relation data store
as well as NoSQL data stores and talk about their pros and cons.
07/14/2025 38
Aggregate Data Model in NoSQL (CO2)
07/14/2025 39
• The following figure presents some sample data for this model.
Aggregate Data Model in NoSQL (CO2)
07/14/2025 40
• In relational, everything is properly normalized. We also have
referential integrity. A realistic order system would naturally be
more involved than this.
Aggregate Data Model in NoSQL (CO2)
07/14/2025 41
Again, we have some sample data, which we’ll show in JSON
format as that’s a common representation for data in NoSQL.
// in customers
{ "
id":1,
"name":"Martin",
"billingAddress":[{"city":"Chicago"}]
}
// in orders
{ "
id":99,
"customerId":1,
"orderItems":[
Hirdesh Sharma RCA E45 Big Data Unit: 2
Aggregate Data Model in NoSQL (CO2)
07/14/2025 42
{
"productId":27,
"price": 32.45,
"productName": "NoSQL Distilled"
}
],
"shippingAddress":[{"city":"Chicago"}]
"orderPayment":[
{
"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft",
"billingAddress": {"city": "Chicago"}
}
],
}
Aggregate Data Model in NoSQL (CO2)
07/14/2025 43
Aggregate Data Model in NoSQL (CO2)
• In this model, we have two main aggregates: customer and order.
We’ve used the black-diamond composition marker in UML to show
how data fits into the aggregation structure.
• The customer contains a list of billing addresses; the order contains
a list of order items, a shipping address, and payments.
• The payment itself contains a billing address for that payment.
07/14/2025 44
Aggregate Oriented Databases (CO2)
Aggregate-oriented databases work best when most data interaction
is done with the same aggregate;
Key-value databases
– Stores data that is opaque to the database
– The database does cannot see the structure of records
– Application needs to deal with this
– Allows flexibility regarding what is stored (i.e. text or binary
data)
Document databases
– Stores data whose structure is visible to the database
– Imposes limitations on what can be stored
– Allows more flexible access to data (i.e. partial records) via
querying
07/14/2025 45
Both key-value and document databases consist of aggregate records
accessed by ID values
Column-family databases
– Two levels of access to aggregates (and hence, two pars to the
“key” to access an aggregate’s data)
– ID is used to look up aggregate record
– Column name – either a label for a value (name) or a key to a list
entry (order id)
– Columns are grouped into column families
Aggregate Oriented Databases (CO2)
07/14/2025 46
Relational Vs Aggregate Data Models (CO2)
07/14/2025 47
Relational Vs Aggregate Data Models (CO2)
07/14/2025 48
Schemaless Databases (CO2)
• A common theme across all the forms of NoSQL databases is that
they are schemaless.
• When you want to store data in a relational database, you first have
to define a schema—a defined structure for the database which says
what tables exist, which columns exist, and what data types each
column can hold.
• Before you store some data, you have to have the schema defined
for it.
07/14/2025 49
Why schemaless?
– A schemaless store also makes it easier to deal with nonuniform
data
– When starting a new development project you don't need to
spend the same amount of time on up-front design of the
schema.
– No need to learn SQL or database specific stuff and tools.
– The rigid schema of a relational database (RDBMS). It can be
harder to push data into the DB as it has to perfectly fit the
schema.
Schemaless Databases (CO2)
07/14/2025 50
Pros:
– More freedom and flexibility
– you can easily change your data organization
– you can deal with non uniform data
Cons:
– A program that accesses data: almost always relies on some form
of implicit schema, it assumes that certain fields are present ,
carry data with a certain meaning
– The implicit schema is shifted into the application code that
accesses data
Schemaless Databases (CO2)
07/14/2025 51
Example:
You store user data in a schemaless database like MongoDB:
json
// In the database { "username": "sana123", "email":
"sana@example.com", "age": 25 }
In your application code:
python
def send_email(user):
if "@example.com" in user["email"]:
email_user(user["email"])
This code assumes:
•user["email"] exists
•It’s a string
•It contains an email
If email is missing or of the wrong type, the code may crash.
07/14/2025 52
Multiple servers:
– In NoSQL systems, data distributed over large clusters.
Single server:
– simplest model, everything on one machine. Run the database on
a single machine that handles all the reads and writes to the data
store.
Distribution Models (CO2)
07/14/2025 53
Sharding:
• DB Sharding is nothing but horizontal partitioning of data. Different
people are accessing different parts of the dataset.
• In these circumstances we can support horizontal scalability by
putting different parts of the data onto different servers—a technique
that’s called sharding.
Orthogonal aspects of data distribution models (C02)
07/14/2025 54
Sharding (C02)
07/14/2025 55
• Different parts of the data onto different servers
– Horizontal scalability
– Ideal case: different users all talking to different server nodes
– Data accessed together on the same node aggregate unit!
̶
• Pros: it can improve both reads and writes
• Cons: Clusters use less reliable machines resilience decreases
̶
• Many NoSQL databases offer auto-sharding
– the database takes on the responsibility of sharding
Sharding (C02)
07/14/2025 56
Sharding (CO2)
Improving performance:
Main rules of sharding:
• Place the data close to where it’s accessed
– Orders for Boston: data in your eastern US data center
– If users in Boston access their data from a server in California, it will
be slower.
– Storing Boston user data in a data center in the eastern US improves
performance.
• Try to keep the load even
– All nodes should get equal amounts of the load
•Storing Boston user data in a data center in the eastern US improves performance.
lacing all items for a single order on the same shard makes reads/writes faster and simple
07/14/2025 Hirdesh Sharma RCA E45 Big Data
Unit: 2
57
• Put together aggregates that may be read in sequence
 Same order, same node
 If one user’s order is spread across multiple shards, reading or updating it
becomes expensive.
 Placing all items for a single order on the same shard makes reads/writes
faster and simpler.
07/14/2025 58
Master Slave Replication (CO2)
Master
– is the authoritative source for the data
– is responsible for processing any updates to that data
– can be appointed manually or automatically
Slaves
– A replication process synchronizes the slaves with the master
– After a failure of the master, a slave can be appointed as new
master very quickly
07/14/2025 59
Master Slave Replication (CO2)
07/14/2025 60
Pros and cons of Master-Slave Replication
Pros
– More read requests:
– Add more slave nodes
– Ensure that all read requests are routed to the slaves
Cons
– The master is a bottleneck
– Limited by its ability to process updates and to pass those
updates on
– Its failure does eliminate the ability to handle writes until:
Master Slave Replication (CO2)
07/14/2025 61
Peer to Peer Replication (CO2)
07/14/2025 62
• All the replicas have equal weight, they can all accept writes
• The loss of any of them doesn’t prevent access to the data store.
Pros and cons of peer-to-peer replication
Pros:
– you can ride over node failures without losing access to data
– you can easily add nodes to improve your performance
Cons:
– Inconsistency
– Slow propagation of changes to copies on different nodes
Peer to Peer Replication (CO2)
07/14/2025 63
• Replication and sharding are strategies that can be combined.
• If we use both master slave replication and sharding, this means that
we have multiple masters, but each data item only has a single
master.
• We have multiple masters, but each data only has a single master.
Two schemes:
– A node can be a master for some data and slaves for others
– Nodes are dedicated for master or slave duties
Sharding and Replication on Master-Slave (CO2)
07/14/2025 64
Sharding and Replication on Master-Slave (CO2)
07/14/2025 65
• Using peer-to-peer replication and sharding is a common strategy
for column family databases.
• Usually each shard is present on three nodes.
• A common strategy for column-family databases.
Sharding and Replication on P2P (CO2)
07/14/2025 66
Sharding and Replication on P2P (CO2)
07/14/2025 67
Key Points:
– Sharding distributes different data across multiple servers, so
each server acts as the single source for a subset of data.
– Replication copies data across multiple servers, so each bit of
data can be found in multiple places.
A system may use either or both techniques. Replication comes in
two forms:
– Master-slave replication makes one node the authoritative copy
that handles writes while slaves synchronize with the master and
may handle reads.
– Peer-to-peer replication allows writes to any node; the nodes
coordinate to synchronize their copies of the data.
Replication (CO2)
07/14/2025 68
Recap (CO2)
• Replication and sharding are strategies that can be combined.
• We have multiple masters, but each data only has a single master.
Two schemes:
– A node can be a master for some data and slaves for others
– Nodes are dedicated for master or slave duties
07/14/2025 69
Topic Name (CO2)
• Consistency and Version Stamp
• Positioning and Combining
07/14/2025 70
Topic Objective (CO2)
After completion of this topic, students will be able to understand:
• Consistency
• Version Stamp
• Positioning and Combining
07/14/2025 71
• The consistency property ensures that any transaction will bring the
database from one valid state to another.
• Relational databases has strong consistency whereas NoSQL
systems hass mostly eventual consistency.
• ACID: A DBMS is expected to support “ACID transactions,”
processes that are:
– Atomicity: either the whole process is done or none is
– Consistency: only valid data are written
– Isolation: one operation at a time
– Durability: once committed, it stays that way
Consistency (CO2)
07/14/2025 72
Update Consistency (or write-write conflict):
• Martin and Pramod are looking at the company website and notice
that the phone number is out of date. Incredibly, they both have
update access, so they both go in at the same time to update the
number.
Hirdesh Sharma RCA E45 Big Data Unit: 2
Various forms of Consistency (CO2)
07/14/2025 73
Update Consistency (or write-write conflict):
Solutions:
– Pessimistic approach
– Prevent conflicts from occurring
Approaches:
– conditional updates: test the value just before updating
– Do not work if there’s more than one server (peer-to-peer
replication)
Hirdesh Sharma RCA E45 Big Data Unit: 2
Various forms of Consistency (CO2)
07/14/2025 74
Read Consistency (or read-write conflict)
• Alice and Bob are using Ticketmaster website to book tickets for a
specific show.
• Only one ticket is left for the specific show. Alice signs on to
Ticketmaster first and finds one left, and finds it expensive. Alice
takes time to decide.
Hirdesh Sharma RCA E45 Big Data Unit: 2
Various forms of Consistency (CO2)
07/14/2025 75
Read Consistency (or read-write conflict)
Hirdesh Sharma RCA E45 Big Data Unit: 2
Various forms of Consistency (CO2)
07/14/2025 76
Various forms of Consistency (CO2)
Replication consistency
• Let’s imagine there’s one last hotel room for a desirable event. The
hotel reservation system runs on many nodes.
• This is another inconsistent read—but it’s a breach of a different
form of consistency we call replication consistency: ensuring that
the same data item has the same value when read from different
replicas.
07/14/2025 77
Replication consistency
Various forms of Consistency (CO2)
07/14/2025 78
Various forms of Consistency (CO2)
Eventual consistency:
• At any time, nodes may have replication inconsistencies but, if there
are no further updates, eventually all nodes will be updated to the
same value.
• In other words, Eventual consistency is a consistency model used in
nosql database to achieve high availability that informally
guarantees that, if no new updates are made to a given data item,
eventually all accesses to that item will return the last updated value.
07/14/2025 79
In Big Data systems (especially distributed databases), a
version stamp is a technique used to track changes and
resolve conflicts in replicated or distributed data.
Version stamp
What Is a Version Stamp?
A version stamp is a marker (often a number, timestamp, or
vector) assigned to a piece of data that changes every time the
data is updated.
Purpose: To know which version of the data is the most recent
and to detect or resolve conflicts when the same data is
updated in different places.
07/14/2025 80
Why Is It Needed in Big Data?
• In distributed systems, data is stored across multiple
machines (nodes). Updates might happen:
• At different times
• On different nodes
• Simultaneously (leading to conflicts)
A version stamp helps to:
• Identify the latest version
• Ensure eventual consistency
• Support conflict resolution
07/14/2025 Hirdesh Sharma RCA E45 Big Data
Unit: 2
81
Type Description
Timestamps
Attach a system time to each update
(e.g., 2025-07-04T14:00:00Z)
Incrementing Counters
Each update increases a version
number (e.g., v1 → v2 → v3)
Vector Clocks
Advanced version tracking across
multiple nodes to detect concurrent
changes
Common Types of Version Stamps
07/14/2025 82
• Example Scenario: Version Stamp with
Timestamps
• Let’s say you have a product record in an e-
commerce system stored in three replicas:
• Replica A: Price = $100, Timestamp = 12:00 PM
• Replica B: Price = $90, Timestamp = 12:05 PM
• Replica C: Price = $100, Timestamp = 12:00 PM
• 🔍 The system compares the version stamps
(timestamps) and determines Replica B has
the most recent version.
07/14/2025 83
System How Versioning Is Used
Apache Cassandra
Uses timestamps to resolve
conflicting writes
Amazon DynamoDB Uses vector clocks internally
HBase
Each cell can store multiple
versions (by timestamp)
MongoDB
Supports write concern and
timestamps for consistency
Used In:
07/14/2025 84
• The CAP Theorem: The basic statement of the CAP theorem is that,
given the three properties of Consistency, Availability, and Partition
tolerance, you can only get two.
• Consistency Every read receives the most recent write or an
error. All nodes see the same data at the same time.
• Availability
Every request (read or write) receives a response, without
guarantee that it contains the most recent data.
Hirdesh Sharma RCA E45 Big Data Unit: 2
Relaxing Consistency (CO2)
•Partition Tolerance (P):
The system continues to operate despite arbitrary message loss
or failure of part of the system (network partitions).
07/14/2025 85
Hirdesh Sharma RCA E45 Big Data Unit: 2
Relaxing Consistency (CO2)
07/14/2025 86
• The CAP theorem states that if you get a network partition, you have
to trade off availability of data versus consistency.
• Very large systems will “partition” at some point::
Hirdesh Sharma RCA E45 Big Data Unit: 2
Network Partition (CO2)
07/14/2025 87
A single-server system is the obvious example of a CA system:
– CA cluster: if a partition occurs, all the nodes would go down
– A failed, unresponsive node doesn’t infer a lack of CAP
availability
• If the system is CA (Consistency + Availability) but NOT Partition
tolerant:
• CA means:
– The system always responds (Availability)
– All nodes have consistent, up-to-date data (Consistency)
• But it cannot tolerate partitions (network failures between nodes).
An example
– Ann is trying to book a room of the Ace Hotel in New York on a
node located in London of a booking system
– Pathin is trying to do the same on a node located in Mumbai
Hirdesh Sharma RCA E45 Big Data Unit: 2
CA System (CO2)
07/14/2025 88
Possible solutions
– CP: Neither user can book any hotel room, sacrificing
availability
– CAP: Designate Mumbai node as the master for Ace hotel
Hirdesh Sharma RCA E45 Big Data Unit: 2
CA System (CO2)
07/14/2025 89
• It is a way to take a big task and divide it into discrete tasks that can
be done in parallel.
• A common use case for Map/Reduce is in document database .
• A Map Reduce program is composed of a Map() procedure that
performs filtering and sorting and a Reduce() procedure that
performs a summary operation.
• "Map" step
• "Reduce" step
Hirdesh Sharma RCA E45 Big Data Unit: 2
Map Reduce (CO2)
07/14/2025 90
Logical view
• The Map function is applied in parallel to every pair in the input
dataset.
• Map(k1,v1) → list(k2,v2)
• The Reduce function is then applied in parallel to each group, which in
turn produces a collection of values in the same domain:
• Reduce(k2, list (v2)) → list(v3)
• Each Reduce call typically produces either one value v3 or an empty
return
Hirdesh Sharma RCA E45 Big Data Unit: 2
Map Reduce (CO2)
07/14/2025 91
Hirdesh Sharma RCA E45 Big Data Unit: 2
Map Reduce (CO2)
07/14/2025 92
Hirdesh Sharma RCA E45 Big Data Unit: 2
Map Reduce (CO2)
07/14/2025 93
Hirdesh Sharma RCA E45 Big Data Unit: 2
Map Reduce (CO2)
07/14/2025 94
Hirdesh Sharma RCA E45 Big Data Unit: 2
Map Reduce (CO2)
Multistage map-reduce calculations: Let us say that we have a set of
documents and its attributes with the following form:
{
"type": "post",
"name": "Raven's Map/Reduce functionality",
"blog_id": 1342,
"post_id": 29293921,
"tags": ["raven", "nosql"],
"post_content": "<p>...</p>",
"comments": [
{
"source_ip": '124.2.21.2',
"author": "martin",
"text": "excellent blog..."
}]
}
07/14/2025 95
Hirdesh Sharma RCA E45 Big Data Unit: 2
Map Reduce (CO2)
• Let us see how this works, we start by applying the map query to the
set of documents that we have, producing this output:
07/14/2025 96
Hirdesh Sharma RCA E45 Big Data Unit: 2
Map Reduce (CO2)
• The next step is to start reducing the results, in real Map/Reduce
algorithms, we partition the original input, and work toward the
final result.
07/14/2025 97
Hirdesh Sharma RCA E45 Big Data Unit: 2
Map Reduce (CO2)
07/14/2025 98
Hirdesh Sharma RCA E45 Big Data Unit: 2
Map Reduce (CO2)
• And the final step is:
07/14/2025 99
Hirdesh Sharma RCA E45 Big Data Unit: 2
RDBMS compared to MapReduce (CO2)
• MapReduce is a good fit for problems that need to analyze the
whole dataset, in a batch fashion, particularly for ad hoc analysis.
• MapReduce suits applications where the data is written once, and
read many times, whereas a relational database is good for datasets
that are continually updated.
07/14/2025 100
Hirdesh Sharma RCA E45 Big Data Unit: 2
RDBMS compared to MapReduce (CO2)
07/14/2025 101
Hirdesh Sharma RCA E45 Big Data Unit: 2
Portioning and Combining (CO2)
• In the simplest form, we think of a map-reduce job as having a
single reduce function.
• The outputs from all the map tasks running on the various nodes are
concatenated together and sent into the reduce.
07/14/2025 102
Hirdesh Sharma RCA E45 Big Data Unit: 2
Portioning and Combining (CO2)
• To take advantage of this, the results of the mapper are divided up
based the key on each processing node.
• Typically, multiple keys are grouped together into partitions. The
framework then takes the data from all the nodes for one partition,
combines it into a single group for that partition, and sends it off to a
reducer.
07/14/2025 103
Hirdesh Sharma RCA E45 Big Data Unit: 2
Portioning and Combining (CO2)
• Reduce Partitioning Example:
07/14/2025 104
Hirdesh Sharma RCA E45 Big Data Unit: 2
Portioning and Combining (CO2)
• Combinable Reducer Example:
07/14/2025 105
Hirdesh Sharma RCA E45 Big Data Unit: 2
Portioning and Combining (CO2)
• Combinable Reducer:
A combiner function is, in essence, a reducer function—indeed, in
many cases the same function can be used for combining as the final
reduction. The reduce function needs a special shape for this to
work: Its output must match its input. We call such a function a
combinable reducer.
07/14/2025 106
• Mapper and Reducer implementations can use the ________ to
report progress or just indicate that they are alive.
a) Partitioner
b) OutputCollector
c) Reporter
d) All of the mentioned
• _________ is the primary interface for a user to describe a
MapReduce job to the Hadoop framework for execution.
a) Map Parameters
b) JobConf
c) MemoryConf
d) None of the mentioned
Hirdesh Sharma RCA E45 Big Data Unit: 2
Daily Quiz
07/14/2025 107
• The consistency property ensures that any transaction will bring the
database from one valid state to another.
• Relational databases has strong consistency whereas NoSQL
systems hass mostly eventual consistency.
Hirdesh Sharma RCA E45 Big Data Unit: 2
Recap
07/14/2025 108
• https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
• https://www.tutorialspoint.com/hadoop/hadoop_mapreduce
.htm
• https://www.sanfoundry.com/mapreduce-questions-answers/
Hirdesh Sharma RCA E45 Big Data Unit: 2
Faculty Video Links, Youtube & NPTEL Video Links and Online
Courses Details
07/14/2025 109
Q:1 Explain the concept of NoSQL in Big Data.
Q:2 Give the difference between Relation database and NoSQL
database.
Q:3 Explain various aggregate data modes:
Key value storage
Document Storage
Graph Storage
Colum value Storage
Q:4 Explain the Schema less database. Also explain the properties
of schema less database.
Hirdesh Sharma RCA E45 Big Data Unit: 2
Weekly Assignment 1
07/14/2025 110
Q:1 Write a short note on following terms:
–Materialized Views
–Distribution Models
–Sharing
Q:2 Explain the following terms:
–Version Stamps
–Map Reduce Calculations
–Portioning and Combining
–Consistency
Q:3 Explain Master Slave Replication with the help of suitable
example.
Q:4 Explain Peer to peer Replication with the help of suitable
example.
Hirdesh Sharma RCA E45 Big Data Unit: 2
Weekly Assignment 2
07/14/2025 111
• Point out the correct statement.
a) MapReduce tries to place the data and the compute as close as
possible
b) Map Task in MapReduce is performed using the Mapper() function
c) Reduce Task in MapReduce is performed using the Map() function
d) All of the mentioned
• Point out the correct statement.
a) Hadoop is an ideal environment for extracting and transforming
small volumes of data
b) Hadoop stores data in HDFS and supports data
compression/decompression
c) The Giraph framework is less useful than a MapReduce job to solve
graph and machine learning
d) None of the mentioned
Hirdesh Sharma RCA E45 Big Data Unit: 2
MCQ s
07/14/2025 112
• What license is Hadoop distributed under?
a) Apache License 2.0
b) Mozilla Public License
c) Shareware
d) Commercial
• Which of the following genres does Hadoop produce?
a) Distributed file system
b) JAX-RS
c) Java Message Service
d) Relational Database Management System
Hirdesh Sharma RCA E45 Big Data Unit: 2
MCQ s
07/14/2025 113
Hirdesh Sharma RCA E45 Big Data Unit: 2
Old Question Papers
07/14/2025 114
Hirdesh Sharma RCA E45 Big Data Unit: 2
Old Question Papers
07/14/2025 115
Hirdesh Sharma RCA E45 Big Data Unit: 2
Old Question Papers
07/14/2025 116
Hirdesh Sharma RCA E45 Big Data Unit: 2
Old Question Papers
07/14/2025 117
Q:1 Write a short note on following terms:
–Materialized Views
–Distribution Models
–Sharing
Q:2 Explain the following terms:
–Version Stamps
–Map Reduce Calculations
–Portioning and Combining
–Consistency
Q:3 Explain Master Slave Replication with the help of suitable
example.
Q:4 Explain Peer to peer Replication with the help of suitable
example.
Hirdesh Sharma RCA E45 Big Data Unit: 2
Expected Questions for University Exam
07/14/2025 118
Q:5 Explain the concept of NoSQL in Big Data.
Q:6 Give the difference between Relation database and NoSQL
database.
Q:7 Explain various aggregate data modes:
Key value storage
Document Storage
Graph Storage
Colum value Storage
Q:8 Explain the Schema less database. Also explain the properties
of schema less database.
Hirdesh Sharma RCA E45 Big Data Unit: 2
Expected Questions for University Exam
07/14/2025 119
• No SQL database, also called Not Only SQL, is an approach to data
management and database design that's useful for very large sets of
distributed data.
• SQL databases have predefined schema whereas No SQL databases
have dynamic schema for unstructured data.
• There are four general types of No SQL databases, each with their
own specific attributes:
–Key value storage
–Document Storage
–Graph Storage
–Colum value Storage
Hirdesh Sharma RCA E45 Big Data Unit: 2
Summary
07/14/2025 120
Hirdesh Sharma RCA E45 Big Data Unit: 2
References
1. Michael Minelli, Michelle Chambers, and Ambiga Dhiraj, "Big Data, Big
Analytics: Emerging Business Intelligence and Analytic Trends for Today's
Businesses", Wiley, 2013.
2. P. J. Sadalage and M. Fowler, "NoSQL Distilled: A Brief Guide to the Emerging
World of
3. Polyglot Persistence", Addison-Wesley Professional, 2012.
4. Tom White, "Hadoop: The Definitive Guide", Third Edition, O'Reilley, 2012.
5. Eric Sammer, "Hadoop Operations", O'Reilley, 2012.
6. E. Capriolo, D. Wampler, and J. Rutherglen, "Programming Hive", O'Reilley,
2012.
7. Lars George, "HBase: The Definitive Guide", O'Reilley, 2011.
8. Eben Hewitt, "Cassandra: The Definitive Guide", O'Reilley, 2010.
9. Alan Gates, "Programming Pig", O'Reilley, 2011.
Thank You

NOSQL IN BIGDATA FOR PG STUDENTS FOR COL

  • 1.
    UNIT II -NOSQL Data Management
  • 2.
    2 Topic Name (CO2) •Introduction to NoSQL • NoSQL Databases
  • 3.
    07/14/2025 3 Topic Objective(CO2) After completion of this topic, students will be able to understand: • What is NoSQL? • NoSQL Databases
  • 4.
    07/14/2025 4 What isNOSQL? (CO2) • NoSQL database, also called Not Only SQL, is an approach to data management and database design that's useful for very large sets of distributed data. • NoSQL is not a relational database. The reality is that a relational database model may not be the best solution for all situations. • The easiest way to think of NoSQL, is that of a database which does not adhering to the traditional relational database management system (RDMS) structure.
  • 5.
    07/14/2025 Hirdesh SharmaRCA E45 Big Data Unit: 2 5 • Traditional Relational Databases (RDBMS) • Use tables with rows and columns. • Data is organized in a fixed schema (structure). • Relationships between data are defined using keys (primary keys, foreign keys). • Examples: MySQL, PostgreSQL, Oracle, SQL Server. What is NOSQL?
  • 6.
    07/14/2025 Hirdesh SharmaRCA E45 Big Data Unit: 2 6 • NoSQL Databases • Do not follow this strict table-based, relational structure. • They are more flexible in how data is stored and organized. • Can store data in various formats like: – Documents (e.g., JSON or BSON) — MongoDB – Key-Value pairs — Redis, DynamoDB – Wide-column stores — Cassandra, HBase – Graphs — Neo4j • No fixed schema required — data can evolve easily without altering a strict structure. • Designed for scalability, high performance, and handling large volumes of diverse data types. What is NOSQL?
  • 7.
    07/14/2025 7 Why AreNoSQL Databases Interesting? / Why we should use Nosql? / when to use Nosql? There are several reasons why people consider using a NoSQL database: • Application development productivity • Large data • Analytics Why Are NoSQL Databases Interesting? (CO2)
  • 8.
    07/14/2025 8 • Scalability:NoSQL databases are designed to scale; it’s one of the primary reasons that people choose a NoSQL database. • Massive write performance: This is probably the canonical usage based on Google's influence, which implies key-value access, MapReduce, replication, fault tolerance, consistency issues, and all the rest. For faster writes in-memory systems can be used. • Fast key-value access: This is probably the second most cited virtue of NoSQL in the general mind set. Why Are NoSQL Databases Interesting? (CO2)
  • 9.
    07/14/2025 9 Why AreNoSQL Databases Interesting? (CO2) • Flexible data model and flexible datatypes: NoSQL products support a whole range of new data types. We have: column-oriented, graph, advanced data structures, document-oriented, and key-value. Complex objects can be easily stored without a lot of mapping. • Schema migration • Write availability • Easier maintainability, administration and operations
  • 10.
    07/14/2025 10 • Nosingle point of failure • Generally available parallel computing • Programmer ease of use: Accessing your data should be easy. Programmers grok keys, values, JSON, Javascript stored procedures, HTTP, and so on. NoSQL is for programmers. Why Are NoSQL Databases Interesting? (CO2)
  • 11.
    07/14/2025 11 • Usethe right data model for the right problem: Different data models are used to solve different problems. • Distributed systems and cloud computing support: Not everyone is worried about scale or performance over and above that which can be achieved by non-NoSQL systems. Why Are NoSQL Databases Interesting? (CO2)
  • 12.
    07/14/2025 12 Difference betweenSQL and NoSQL (CO2) • SQL databases are primarily called as Relational Databases (RDBMS); whereas NoSQL database are primarily called as non- relational or distributed database. • SQL databases are table based databases whereas NoSQL databases are document based, key-value pairs, graph databases or wide- column stores.
  • 13.
    07/14/2025 13 Difference betweenSQL and NoSQL (CO2) • SQL databases are scaled by increasing the horse-power of the hardware. NoSQL databases are scaled by increasing the databases servers in the pool of resources to reduce the load. • SQL database examples: MySql, Oracle, Sqlite, Postgres and MS- SQL. NoSQL database examples: MongoDB, BigTable, Redis, RavenDb, Cassandra, Hbase, Neo4j and CouchDb.
  • 14.
    07/14/2025 14 • Forthe type of data to be stored: SQL databases are not best fit for hierarchical data storage. But, NoSQL database fits better for the hierarchical data storage as it follows the key-value pair way of storing data similar to JSON data. NoSQL database are highly preferred for large data set (i.e for big data). Difference between SQL and NoSQL (CO2)
  • 15.
    07/14/2025 15 • Forproperties: SQL databases emphasizes on ACID properties ( Atomicity, Consistency, Isolation and Durability) whereas the NoSQL database follows the Brewers CAP theorem ( Consistency, Availability and Partition tolerance ). • For DB types: On a high-level, we can classify SQL databases as either open-source or close-sourced from commercial vendors. NoSQL databases can be classified on the basis of way of storing data as graph databases, key-value store databases, document store databases, column store database and XML databases. Difference between SQL and NoSQL (CO2)
  • 16.
    07/14/2025 16 Hirdesh SharmaRCA E45 Big Data Unit: 2 Type of NoSQL Database (CO2) There are four general types of NoSQL databases, each with their own specific attributes: 1. Key-Value storage 2. Document Databases 3. Column Storage 4. Graph Storage
  • 17.
    07/14/2025 17 • Key-Valuestorage: This is the first category of NoSQL database. Key-value stores have a simple data model, which allow clients to put a map/dictionary and request value par key. In the key-value storage, each key has to be unique to provide non-ambiguous identification of values. For example: Type of NoSQL Database (CO2)
  • 18.
    07/14/2025 18 • Documentdatabases: In the document database NoSQL store document in JSON format. JSON-based document are store in completely different sets of attributes can be stored together, which stores highly unstructured data as named value pairs and applications that look at user behavior, actions, and logs in real time. Type of NoSQL Database (CO2)
  • 19.
    07/14/2025 19 • Columnsstorage: Columnar databases are almost like tabular databases. Thus keys in wide column store scan have many dimensions, resulting in a structure similar to a multi-dimensional, associative array. Shown in below example storing data in a wide column system using a two-dimensional key. Type of NoSQL Database (CO2)
  • 20.
    07/14/2025 20 Type ofNoSQL Database (CO2) • Graph storage: Graph databases are best suited for representing data with a high, yet flexible number of interconnections, especially when information about those interconnections is at least as important as there presented data. In NoSQL database, data is stored in a graph like structures in graph databases, so that the data can be made easily accessible. Graph databases are commonly used on social networking sites. As show in below figure.
  • 21.
  • 22.
    07/14/2025 22 Advantages – Datapersistence – Concurrency – ACID, transactions, etc. – Integration across multiple applications – Standard Model – tables and SQL Disadvantages – Impedance mismatch – Integration databases vs. application databases – Not designed for clustering Pros and Cons of Relational Databases (CO2)
  • 23.
    07/14/2025 23 CONS Impedance Mismatch Intraditional relational databases, data is stored in tables with rows and columns, while in most programming languages (especially object-oriented ones), data is represented as objects. Why it's a disadvantage: – There’s a "mismatch" between how data is structured in the database (tables) and how it's structured in the application (objects). – This often requires extra work to map database tables to programming objects, called Object-Relational Mapping (ORM). – NoSQL tries to solve this by storing data in formats like JSON documents that are closer to application objects, but sometimes this mismatch still exists, especially when mixing NoSQL with relational data or legacy systems.
  • 24.
    07/14/2025 24 Integration Databasesvs. Application Databases – Some databases are designed as integration points to bring data together from multiple sources (integration databases). – Others are designed primarily to support a specific application's data needs (application databases). Why it's a disadvantage: – NoSQL databases are usually application-specific, optimized for a particular app’s data access patterns. – This can make it hard to integrate data from multiple systems or do complex queries spanning different data sources, unlike relational databases which excel at integration. – So if your use case requires combining or analyzing data from many systems, NoSQL might be less ideal.
  • 25.
    07/14/2025 25 Not Designedfor Clustering of NoSQL (in some cases) – Clustering means running multiple database servers together to work as a single system, providing high availability and scalability. – Some early or simpler NoSQL databases were not built to support clustering or distributed operation well. Why it's a disadvantage: – Without good clustering, databases can become a single point of failure or struggle with scaling as data grows. – This means they might not handle big loads or uptime requirements as well as relational databases with mature clustering features. – However, many modern NoSQL systems do support clustering and distributed data, but it depends on the specific database.
  • 26.
    07/14/2025 26 Some commoncharacteristics of nosql include: • Does not use the relational model (mostly) • Generally open source projects (currently) • Driven by the need to run on clusters • Built for the need to run 21st century web properties • Schema-less • Polygot persistence • Auto Sharding Characteristics of NoSQL (CO2)
  • 27.
    07/14/2025 27 • Thepoint of view of using different data stores in different circumstances is known as Polyglot Persistence. • Polyglot persistence is commonly used to define this hybrid approach. • The definition of polyglot is “someone who speaks or writes several languages.” The term polyglot is redefined for big data as a set of applications that use several core database technologies. Polygot Persistence (CO2)
  • 28.
  • 29.
    07/14/2025 29 • Whatlicense is Hadoop distributed under? a) Apache License 2.0 b) Mozilla Public License c) Shareware d) Commercial • Which of the following genres does Hadoop produce? a) Distributed file system b) JAX-RS c) Java Message Service d) Relational Database Management System Daily Quiz
  • 30.
    07/14/2025 30 Recap • NoSQLdatabase, also called Not Only SQL, is an approach to data management and database design that's useful for very large sets of distributed data.
  • 31.
    07/14/2025 31 Hirdesh SharmaRCA E45 Big Data Unit: 2 Topic Name (CO2) • NoSQL Data Models
  • 32.
    07/14/2025 32 Hirdesh SharmaRCA E45 Big Data Unit: 2 Topic Objective (CO2) After completion of this topic, students will be able to understand: • NoSQL Data Models • Aggregate Data Models
  • 33.
    07/14/2025 33 • NoSQLdatabases have a very different model. For example, a document-oriented NoSQL database takes the data you want to store and aggregates it into documents using the JSON format. • Each JSON document can be thought of as an object to be used by your application. • A JSON document might, for example, take all the data stored in a row that spans 20 tables of a relational database and aggregate it into a single document/object. NoSQL Data Model (CO2)
  • 34.
  • 35.
    07/14/2025 35 • Anothermajor difference is that relational technologies have rigid schemas while NoSQL models are schemaless. • The exact opposite of the behavior desired in the Big Data era, where application developers need to constantly – and rapidly – incorporate new types of data to enrich their apps. NoSQL Data Model (CO2)
  • 36.
    07/14/2025 36 Aggregate DataModel in NoSQL (CO2) • Data Model: A data model is the model through which we perceive and manipulate our data. • Relational Data Model: The relational model takes the information that we want to store and divides it into tuples. • Aggregate Model: Aggregate is a term that comes from Domain- Driven Design, an aggregate is a collection of related objects.
  • 37.
    07/14/2025 37 Aggregate DataModel in NoSQL (CO2) • Atomic property holds within an aggregate. • Communication with data storage happens in unit of aggregate. Example of Relations and Aggregates • Let’s assume we have to build an e-commerce website; we are going to be selling items directly to customers over the web. • We can use this scenario to model the data using a relation data store as well as NoSQL data stores and talk about their pros and cons.
  • 38.
    07/14/2025 38 Aggregate DataModel in NoSQL (CO2)
  • 39.
    07/14/2025 39 • Thefollowing figure presents some sample data for this model. Aggregate Data Model in NoSQL (CO2)
  • 40.
    07/14/2025 40 • Inrelational, everything is properly normalized. We also have referential integrity. A realistic order system would naturally be more involved than this. Aggregate Data Model in NoSQL (CO2)
  • 41.
    07/14/2025 41 Again, wehave some sample data, which we’ll show in JSON format as that’s a common representation for data in NoSQL. // in customers { " id":1, "name":"Martin", "billingAddress":[{"city":"Chicago"}] } // in orders { " id":99, "customerId":1, "orderItems":[ Hirdesh Sharma RCA E45 Big Data Unit: 2 Aggregate Data Model in NoSQL (CO2)
  • 42.
    07/14/2025 42 { "productId":27, "price": 32.45, "productName":"NoSQL Distilled" } ], "shippingAddress":[{"city":"Chicago"}] "orderPayment":[ { "ccinfo":"1000-1000-1000-1000", "txnId":"abelif879rft", "billingAddress": {"city": "Chicago"} } ], } Aggregate Data Model in NoSQL (CO2)
  • 43.
    07/14/2025 43 Aggregate DataModel in NoSQL (CO2) • In this model, we have two main aggregates: customer and order. We’ve used the black-diamond composition marker in UML to show how data fits into the aggregation structure. • The customer contains a list of billing addresses; the order contains a list of order items, a shipping address, and payments. • The payment itself contains a billing address for that payment.
  • 44.
    07/14/2025 44 Aggregate OrientedDatabases (CO2) Aggregate-oriented databases work best when most data interaction is done with the same aggregate; Key-value databases – Stores data that is opaque to the database – The database does cannot see the structure of records – Application needs to deal with this – Allows flexibility regarding what is stored (i.e. text or binary data) Document databases – Stores data whose structure is visible to the database – Imposes limitations on what can be stored – Allows more flexible access to data (i.e. partial records) via querying
  • 45.
    07/14/2025 45 Both key-valueand document databases consist of aggregate records accessed by ID values Column-family databases – Two levels of access to aggregates (and hence, two pars to the “key” to access an aggregate’s data) – ID is used to look up aggregate record – Column name – either a label for a value (name) or a key to a list entry (order id) – Columns are grouped into column families Aggregate Oriented Databases (CO2)
  • 46.
    07/14/2025 46 Relational VsAggregate Data Models (CO2)
  • 47.
    07/14/2025 47 Relational VsAggregate Data Models (CO2)
  • 48.
    07/14/2025 48 Schemaless Databases(CO2) • A common theme across all the forms of NoSQL databases is that they are schemaless. • When you want to store data in a relational database, you first have to define a schema—a defined structure for the database which says what tables exist, which columns exist, and what data types each column can hold. • Before you store some data, you have to have the schema defined for it.
  • 49.
    07/14/2025 49 Why schemaless? –A schemaless store also makes it easier to deal with nonuniform data – When starting a new development project you don't need to spend the same amount of time on up-front design of the schema. – No need to learn SQL or database specific stuff and tools. – The rigid schema of a relational database (RDBMS). It can be harder to push data into the DB as it has to perfectly fit the schema. Schemaless Databases (CO2)
  • 50.
    07/14/2025 50 Pros: – Morefreedom and flexibility – you can easily change your data organization – you can deal with non uniform data Cons: – A program that accesses data: almost always relies on some form of implicit schema, it assumes that certain fields are present , carry data with a certain meaning – The implicit schema is shifted into the application code that accesses data Schemaless Databases (CO2)
  • 51.
    07/14/2025 51 Example: You storeuser data in a schemaless database like MongoDB: json // In the database { "username": "sana123", "email": "sana@example.com", "age": 25 } In your application code: python def send_email(user): if "@example.com" in user["email"]: email_user(user["email"]) This code assumes: •user["email"] exists •It’s a string •It contains an email If email is missing or of the wrong type, the code may crash.
  • 52.
    07/14/2025 52 Multiple servers: –In NoSQL systems, data distributed over large clusters. Single server: – simplest model, everything on one machine. Run the database on a single machine that handles all the reads and writes to the data store. Distribution Models (CO2)
  • 53.
    07/14/2025 53 Sharding: • DBSharding is nothing but horizontal partitioning of data. Different people are accessing different parts of the dataset. • In these circumstances we can support horizontal scalability by putting different parts of the data onto different servers—a technique that’s called sharding. Orthogonal aspects of data distribution models (C02)
  • 54.
  • 55.
    07/14/2025 55 • Differentparts of the data onto different servers – Horizontal scalability – Ideal case: different users all talking to different server nodes – Data accessed together on the same node aggregate unit! ̶ • Pros: it can improve both reads and writes • Cons: Clusters use less reliable machines resilience decreases ̶ • Many NoSQL databases offer auto-sharding – the database takes on the responsibility of sharding Sharding (C02)
  • 56.
    07/14/2025 56 Sharding (CO2) Improvingperformance: Main rules of sharding: • Place the data close to where it’s accessed – Orders for Boston: data in your eastern US data center – If users in Boston access their data from a server in California, it will be slower. – Storing Boston user data in a data center in the eastern US improves performance. • Try to keep the load even – All nodes should get equal amounts of the load •Storing Boston user data in a data center in the eastern US improves performance. lacing all items for a single order on the same shard makes reads/writes faster and simple
  • 57.
    07/14/2025 Hirdesh SharmaRCA E45 Big Data Unit: 2 57 • Put together aggregates that may be read in sequence  Same order, same node  If one user’s order is spread across multiple shards, reading or updating it becomes expensive.  Placing all items for a single order on the same shard makes reads/writes faster and simpler.
  • 58.
    07/14/2025 58 Master SlaveReplication (CO2) Master – is the authoritative source for the data – is responsible for processing any updates to that data – can be appointed manually or automatically Slaves – A replication process synchronizes the slaves with the master – After a failure of the master, a slave can be appointed as new master very quickly
  • 59.
    07/14/2025 59 Master SlaveReplication (CO2)
  • 60.
    07/14/2025 60 Pros andcons of Master-Slave Replication Pros – More read requests: – Add more slave nodes – Ensure that all read requests are routed to the slaves Cons – The master is a bottleneck – Limited by its ability to process updates and to pass those updates on – Its failure does eliminate the ability to handle writes until: Master Slave Replication (CO2)
  • 61.
    07/14/2025 61 Peer toPeer Replication (CO2)
  • 62.
    07/14/2025 62 • Allthe replicas have equal weight, they can all accept writes • The loss of any of them doesn’t prevent access to the data store. Pros and cons of peer-to-peer replication Pros: – you can ride over node failures without losing access to data – you can easily add nodes to improve your performance Cons: – Inconsistency – Slow propagation of changes to copies on different nodes Peer to Peer Replication (CO2)
  • 63.
    07/14/2025 63 • Replicationand sharding are strategies that can be combined. • If we use both master slave replication and sharding, this means that we have multiple masters, but each data item only has a single master. • We have multiple masters, but each data only has a single master. Two schemes: – A node can be a master for some data and slaves for others – Nodes are dedicated for master or slave duties Sharding and Replication on Master-Slave (CO2)
  • 64.
    07/14/2025 64 Sharding andReplication on Master-Slave (CO2)
  • 65.
    07/14/2025 65 • Usingpeer-to-peer replication and sharding is a common strategy for column family databases. • Usually each shard is present on three nodes. • A common strategy for column-family databases. Sharding and Replication on P2P (CO2)
  • 66.
    07/14/2025 66 Sharding andReplication on P2P (CO2)
  • 67.
    07/14/2025 67 Key Points: –Sharding distributes different data across multiple servers, so each server acts as the single source for a subset of data. – Replication copies data across multiple servers, so each bit of data can be found in multiple places. A system may use either or both techniques. Replication comes in two forms: – Master-slave replication makes one node the authoritative copy that handles writes while slaves synchronize with the master and may handle reads. – Peer-to-peer replication allows writes to any node; the nodes coordinate to synchronize their copies of the data. Replication (CO2)
  • 68.
    07/14/2025 68 Recap (CO2) •Replication and sharding are strategies that can be combined. • We have multiple masters, but each data only has a single master. Two schemes: – A node can be a master for some data and slaves for others – Nodes are dedicated for master or slave duties
  • 69.
    07/14/2025 69 Topic Name(CO2) • Consistency and Version Stamp • Positioning and Combining
  • 70.
    07/14/2025 70 Topic Objective(CO2) After completion of this topic, students will be able to understand: • Consistency • Version Stamp • Positioning and Combining
  • 71.
    07/14/2025 71 • Theconsistency property ensures that any transaction will bring the database from one valid state to another. • Relational databases has strong consistency whereas NoSQL systems hass mostly eventual consistency. • ACID: A DBMS is expected to support “ACID transactions,” processes that are: – Atomicity: either the whole process is done or none is – Consistency: only valid data are written – Isolation: one operation at a time – Durability: once committed, it stays that way Consistency (CO2)
  • 72.
    07/14/2025 72 Update Consistency(or write-write conflict): • Martin and Pramod are looking at the company website and notice that the phone number is out of date. Incredibly, they both have update access, so they both go in at the same time to update the number. Hirdesh Sharma RCA E45 Big Data Unit: 2 Various forms of Consistency (CO2)
  • 73.
    07/14/2025 73 Update Consistency(or write-write conflict): Solutions: – Pessimistic approach – Prevent conflicts from occurring Approaches: – conditional updates: test the value just before updating – Do not work if there’s more than one server (peer-to-peer replication) Hirdesh Sharma RCA E45 Big Data Unit: 2 Various forms of Consistency (CO2)
  • 74.
    07/14/2025 74 Read Consistency(or read-write conflict) • Alice and Bob are using Ticketmaster website to book tickets for a specific show. • Only one ticket is left for the specific show. Alice signs on to Ticketmaster first and finds one left, and finds it expensive. Alice takes time to decide. Hirdesh Sharma RCA E45 Big Data Unit: 2 Various forms of Consistency (CO2)
  • 75.
    07/14/2025 75 Read Consistency(or read-write conflict) Hirdesh Sharma RCA E45 Big Data Unit: 2 Various forms of Consistency (CO2)
  • 76.
    07/14/2025 76 Various formsof Consistency (CO2) Replication consistency • Let’s imagine there’s one last hotel room for a desirable event. The hotel reservation system runs on many nodes. • This is another inconsistent read—but it’s a breach of a different form of consistency we call replication consistency: ensuring that the same data item has the same value when read from different replicas.
  • 77.
  • 78.
    07/14/2025 78 Various formsof Consistency (CO2) Eventual consistency: • At any time, nodes may have replication inconsistencies but, if there are no further updates, eventually all nodes will be updated to the same value. • In other words, Eventual consistency is a consistency model used in nosql database to achieve high availability that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value.
  • 79.
    07/14/2025 79 In BigData systems (especially distributed databases), a version stamp is a technique used to track changes and resolve conflicts in replicated or distributed data. Version stamp What Is a Version Stamp? A version stamp is a marker (often a number, timestamp, or vector) assigned to a piece of data that changes every time the data is updated. Purpose: To know which version of the data is the most recent and to detect or resolve conflicts when the same data is updated in different places.
  • 80.
    07/14/2025 80 Why IsIt Needed in Big Data? • In distributed systems, data is stored across multiple machines (nodes). Updates might happen: • At different times • On different nodes • Simultaneously (leading to conflicts) A version stamp helps to: • Identify the latest version • Ensure eventual consistency • Support conflict resolution
  • 81.
    07/14/2025 Hirdesh SharmaRCA E45 Big Data Unit: 2 81 Type Description Timestamps Attach a system time to each update (e.g., 2025-07-04T14:00:00Z) Incrementing Counters Each update increases a version number (e.g., v1 → v2 → v3) Vector Clocks Advanced version tracking across multiple nodes to detect concurrent changes Common Types of Version Stamps
  • 82.
    07/14/2025 82 • ExampleScenario: Version Stamp with Timestamps • Let’s say you have a product record in an e- commerce system stored in three replicas: • Replica A: Price = $100, Timestamp = 12:00 PM • Replica B: Price = $90, Timestamp = 12:05 PM • Replica C: Price = $100, Timestamp = 12:00 PM • 🔍 The system compares the version stamps (timestamps) and determines Replica B has the most recent version.
  • 83.
    07/14/2025 83 System HowVersioning Is Used Apache Cassandra Uses timestamps to resolve conflicting writes Amazon DynamoDB Uses vector clocks internally HBase Each cell can store multiple versions (by timestamp) MongoDB Supports write concern and timestamps for consistency Used In:
  • 84.
    07/14/2025 84 • TheCAP Theorem: The basic statement of the CAP theorem is that, given the three properties of Consistency, Availability, and Partition tolerance, you can only get two. • Consistency Every read receives the most recent write or an error. All nodes see the same data at the same time. • Availability Every request (read or write) receives a response, without guarantee that it contains the most recent data. Hirdesh Sharma RCA E45 Big Data Unit: 2 Relaxing Consistency (CO2) •Partition Tolerance (P): The system continues to operate despite arbitrary message loss or failure of part of the system (network partitions).
  • 85.
    07/14/2025 85 Hirdesh SharmaRCA E45 Big Data Unit: 2 Relaxing Consistency (CO2)
  • 86.
    07/14/2025 86 • TheCAP theorem states that if you get a network partition, you have to trade off availability of data versus consistency. • Very large systems will “partition” at some point:: Hirdesh Sharma RCA E45 Big Data Unit: 2 Network Partition (CO2)
  • 87.
    07/14/2025 87 A single-serversystem is the obvious example of a CA system: – CA cluster: if a partition occurs, all the nodes would go down – A failed, unresponsive node doesn’t infer a lack of CAP availability • If the system is CA (Consistency + Availability) but NOT Partition tolerant: • CA means: – The system always responds (Availability) – All nodes have consistent, up-to-date data (Consistency) • But it cannot tolerate partitions (network failures between nodes). An example – Ann is trying to book a room of the Ace Hotel in New York on a node located in London of a booking system – Pathin is trying to do the same on a node located in Mumbai Hirdesh Sharma RCA E45 Big Data Unit: 2 CA System (CO2)
  • 88.
    07/14/2025 88 Possible solutions –CP: Neither user can book any hotel room, sacrificing availability – CAP: Designate Mumbai node as the master for Ace hotel Hirdesh Sharma RCA E45 Big Data Unit: 2 CA System (CO2)
  • 89.
    07/14/2025 89 • Itis a way to take a big task and divide it into discrete tasks that can be done in parallel. • A common use case for Map/Reduce is in document database . • A Map Reduce program is composed of a Map() procedure that performs filtering and sorting and a Reduce() procedure that performs a summary operation. • "Map" step • "Reduce" step Hirdesh Sharma RCA E45 Big Data Unit: 2 Map Reduce (CO2)
  • 90.
    07/14/2025 90 Logical view •The Map function is applied in parallel to every pair in the input dataset. • Map(k1,v1) → list(k2,v2) • The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: • Reduce(k2, list (v2)) → list(v3) • Each Reduce call typically produces either one value v3 or an empty return Hirdesh Sharma RCA E45 Big Data Unit: 2 Map Reduce (CO2)
  • 91.
    07/14/2025 91 Hirdesh SharmaRCA E45 Big Data Unit: 2 Map Reduce (CO2)
  • 92.
    07/14/2025 92 Hirdesh SharmaRCA E45 Big Data Unit: 2 Map Reduce (CO2)
  • 93.
    07/14/2025 93 Hirdesh SharmaRCA E45 Big Data Unit: 2 Map Reduce (CO2)
  • 94.
    07/14/2025 94 Hirdesh SharmaRCA E45 Big Data Unit: 2 Map Reduce (CO2) Multistage map-reduce calculations: Let us say that we have a set of documents and its attributes with the following form: { "type": "post", "name": "Raven's Map/Reduce functionality", "blog_id": 1342, "post_id": 29293921, "tags": ["raven", "nosql"], "post_content": "<p>...</p>", "comments": [ { "source_ip": '124.2.21.2', "author": "martin", "text": "excellent blog..." }] }
  • 95.
    07/14/2025 95 Hirdesh SharmaRCA E45 Big Data Unit: 2 Map Reduce (CO2) • Let us see how this works, we start by applying the map query to the set of documents that we have, producing this output:
  • 96.
    07/14/2025 96 Hirdesh SharmaRCA E45 Big Data Unit: 2 Map Reduce (CO2) • The next step is to start reducing the results, in real Map/Reduce algorithms, we partition the original input, and work toward the final result.
  • 97.
    07/14/2025 97 Hirdesh SharmaRCA E45 Big Data Unit: 2 Map Reduce (CO2)
  • 98.
    07/14/2025 98 Hirdesh SharmaRCA E45 Big Data Unit: 2 Map Reduce (CO2) • And the final step is:
  • 99.
    07/14/2025 99 Hirdesh SharmaRCA E45 Big Data Unit: 2 RDBMS compared to MapReduce (CO2) • MapReduce is a good fit for problems that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis. • MapReduce suits applications where the data is written once, and read many times, whereas a relational database is good for datasets that are continually updated.
  • 100.
    07/14/2025 100 Hirdesh SharmaRCA E45 Big Data Unit: 2 RDBMS compared to MapReduce (CO2)
  • 101.
    07/14/2025 101 Hirdesh SharmaRCA E45 Big Data Unit: 2 Portioning and Combining (CO2) • In the simplest form, we think of a map-reduce job as having a single reduce function. • The outputs from all the map tasks running on the various nodes are concatenated together and sent into the reduce.
  • 102.
    07/14/2025 102 Hirdesh SharmaRCA E45 Big Data Unit: 2 Portioning and Combining (CO2) • To take advantage of this, the results of the mapper are divided up based the key on each processing node. • Typically, multiple keys are grouped together into partitions. The framework then takes the data from all the nodes for one partition, combines it into a single group for that partition, and sends it off to a reducer.
  • 103.
    07/14/2025 103 Hirdesh SharmaRCA E45 Big Data Unit: 2 Portioning and Combining (CO2) • Reduce Partitioning Example:
  • 104.
    07/14/2025 104 Hirdesh SharmaRCA E45 Big Data Unit: 2 Portioning and Combining (CO2) • Combinable Reducer Example:
  • 105.
    07/14/2025 105 Hirdesh SharmaRCA E45 Big Data Unit: 2 Portioning and Combining (CO2) • Combinable Reducer: A combiner function is, in essence, a reducer function—indeed, in many cases the same function can be used for combining as the final reduction. The reduce function needs a special shape for this to work: Its output must match its input. We call such a function a combinable reducer.
  • 106.
    07/14/2025 106 • Mapperand Reducer implementations can use the ________ to report progress or just indicate that they are alive. a) Partitioner b) OutputCollector c) Reporter d) All of the mentioned • _________ is the primary interface for a user to describe a MapReduce job to the Hadoop framework for execution. a) Map Parameters b) JobConf c) MemoryConf d) None of the mentioned Hirdesh Sharma RCA E45 Big Data Unit: 2 Daily Quiz
  • 107.
    07/14/2025 107 • Theconsistency property ensures that any transaction will bring the database from one valid state to another. • Relational databases has strong consistency whereas NoSQL systems hass mostly eventual consistency. Hirdesh Sharma RCA E45 Big Data Unit: 2 Recap
  • 108.
    07/14/2025 108 • https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html •https://www.tutorialspoint.com/hadoop/hadoop_mapreduce .htm • https://www.sanfoundry.com/mapreduce-questions-answers/ Hirdesh Sharma RCA E45 Big Data Unit: 2 Faculty Video Links, Youtube & NPTEL Video Links and Online Courses Details
  • 109.
    07/14/2025 109 Q:1 Explainthe concept of NoSQL in Big Data. Q:2 Give the difference between Relation database and NoSQL database. Q:3 Explain various aggregate data modes: Key value storage Document Storage Graph Storage Colum value Storage Q:4 Explain the Schema less database. Also explain the properties of schema less database. Hirdesh Sharma RCA E45 Big Data Unit: 2 Weekly Assignment 1
  • 110.
    07/14/2025 110 Q:1 Writea short note on following terms: –Materialized Views –Distribution Models –Sharing Q:2 Explain the following terms: –Version Stamps –Map Reduce Calculations –Portioning and Combining –Consistency Q:3 Explain Master Slave Replication with the help of suitable example. Q:4 Explain Peer to peer Replication with the help of suitable example. Hirdesh Sharma RCA E45 Big Data Unit: 2 Weekly Assignment 2
  • 111.
    07/14/2025 111 • Pointout the correct statement. a) MapReduce tries to place the data and the compute as close as possible b) Map Task in MapReduce is performed using the Mapper() function c) Reduce Task in MapReduce is performed using the Map() function d) All of the mentioned • Point out the correct statement. a) Hadoop is an ideal environment for extracting and transforming small volumes of data b) Hadoop stores data in HDFS and supports data compression/decompression c) The Giraph framework is less useful than a MapReduce job to solve graph and machine learning d) None of the mentioned Hirdesh Sharma RCA E45 Big Data Unit: 2 MCQ s
  • 112.
    07/14/2025 112 • Whatlicense is Hadoop distributed under? a) Apache License 2.0 b) Mozilla Public License c) Shareware d) Commercial • Which of the following genres does Hadoop produce? a) Distributed file system b) JAX-RS c) Java Message Service d) Relational Database Management System Hirdesh Sharma RCA E45 Big Data Unit: 2 MCQ s
  • 113.
    07/14/2025 113 Hirdesh SharmaRCA E45 Big Data Unit: 2 Old Question Papers
  • 114.
    07/14/2025 114 Hirdesh SharmaRCA E45 Big Data Unit: 2 Old Question Papers
  • 115.
    07/14/2025 115 Hirdesh SharmaRCA E45 Big Data Unit: 2 Old Question Papers
  • 116.
    07/14/2025 116 Hirdesh SharmaRCA E45 Big Data Unit: 2 Old Question Papers
  • 117.
    07/14/2025 117 Q:1 Writea short note on following terms: –Materialized Views –Distribution Models –Sharing Q:2 Explain the following terms: –Version Stamps –Map Reduce Calculations –Portioning and Combining –Consistency Q:3 Explain Master Slave Replication with the help of suitable example. Q:4 Explain Peer to peer Replication with the help of suitable example. Hirdesh Sharma RCA E45 Big Data Unit: 2 Expected Questions for University Exam
  • 118.
    07/14/2025 118 Q:5 Explainthe concept of NoSQL in Big Data. Q:6 Give the difference between Relation database and NoSQL database. Q:7 Explain various aggregate data modes: Key value storage Document Storage Graph Storage Colum value Storage Q:8 Explain the Schema less database. Also explain the properties of schema less database. Hirdesh Sharma RCA E45 Big Data Unit: 2 Expected Questions for University Exam
  • 119.
    07/14/2025 119 • NoSQL database, also called Not Only SQL, is an approach to data management and database design that's useful for very large sets of distributed data. • SQL databases have predefined schema whereas No SQL databases have dynamic schema for unstructured data. • There are four general types of No SQL databases, each with their own specific attributes: –Key value storage –Document Storage –Graph Storage –Colum value Storage Hirdesh Sharma RCA E45 Big Data Unit: 2 Summary
  • 120.
    07/14/2025 120 Hirdesh SharmaRCA E45 Big Data Unit: 2 References 1. Michael Minelli, Michelle Chambers, and Ambiga Dhiraj, "Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses", Wiley, 2013. 2. P. J. Sadalage and M. Fowler, "NoSQL Distilled: A Brief Guide to the Emerging World of 3. Polyglot Persistence", Addison-Wesley Professional, 2012. 4. Tom White, "Hadoop: The Definitive Guide", Third Edition, O'Reilley, 2012. 5. Eric Sammer, "Hadoop Operations", O'Reilley, 2012. 6. E. Capriolo, D. Wampler, and J. Rutherglen, "Programming Hive", O'Reilley, 2012. 7. Lars George, "HBase: The Definitive Guide", O'Reilley, 2011. 8. Eben Hewitt, "Cassandra: The Definitive Guide", O'Reilley, 2010. 9. Alan Gates, "Programming Pig", O'Reilley, 2011. Thank You