NOSQL IN BIGDATA FOR PG STUDENTS FOR COL

UNIT II -NO SQL Data Management

2
Topic Name (CO2)
• Introduction to NoSQL
• NoSQL Databases

07/14/2025 3
Topic Objective (CO2)
After completion of this topic, students will be able to understand:
• What is NoSQL?
• NoSQL Databases

07/14/2025 4
What is NOSQL? (CO2)
• NoSQL database, also called Not Only SQL, is an approach to data
management and database design that's useful for very large sets of
distributed data.
• NoSQL is not a relational database. The reality is that a relational
database model may not be the best solution for all situations.
• The easiest way to think of NoSQL, is that of a database which does
not adhering to the traditional relational database management
system (RDMS) structure.

07/14/2025 Hirdesh Sharma RCA E45 Big Data
Unit: 2
5
• Traditional Relational Databases (RDBMS)
• Use tables with rows and columns.
• Data is organized in a fixed schema (structure).
• Relationships between data are defined using keys (primary
keys, foreign keys).
• Examples: MySQL, PostgreSQL, Oracle, SQL Server.
What is NOSQL?

Unit: 2
6
• NoSQL Databases
• Do not follow this strict table-based, relational structure.
• They are more flexible in how data is stored and organized.
• Can store data in various formats like:
– Documents (e.g., JSON or BSON) — MongoDB
– Key-Value pairs — Redis, DynamoDB
– Wide-column stores — Cassandra, HBase
– Graphs — Neo4j
• No fixed schema required — data can evolve easily without
altering a strict structure.
• Designed for scalability, high performance, and handling large
volumes of diverse data types.
What is NOSQL?

07/14/2025 7
Why Are NoSQL Databases Interesting? / Why we should use
Nosql? / when to use Nosql?
There are several reasons why people consider using a NoSQL
database:
• Application development productivity
• Large data
• Analytics
Why Are NoSQL Databases Interesting? (CO2)

07/14/2025 8
• Scalability: NoSQL databases are designed to scale; it’s one of the
primary reasons that people choose a NoSQL database.
• Massive write performance: This is probably the canonical usage
based on Google's influence, which implies key-value access,
MapReduce, replication, fault tolerance, consistency issues, and all
the rest. For faster writes in-memory systems can be used.
• Fast key-value access: This is probably the second most cited
virtue of NoSQL in the general mind set.

07/14/2025 9
• Flexible data model and flexible datatypes: NoSQL products
support a whole range of new data types. We have: column-oriented,
graph, advanced data structures, document-oriented, and key-value.
Complex objects can be easily stored without a lot of mapping.
• Schema migration
• Write availability
• Easier maintainability, administration and operations

07/14/2025 10
• No single point of failure
• Generally available parallel computing
• Programmer ease of use: Accessing your data should be easy.
Programmers grok keys, values, JSON, Javascript stored procedures,
HTTP, and so on. NoSQL is for programmers.

07/14/2025 11
• Use the right data model for the right problem: Different data
models are used to solve different problems.
• Distributed systems and cloud computing support: Not everyone
is worried about scale or performance over and above that which
can be achieved by non-NoSQL systems.

07/14/2025 12
Difference between SQL and NoSQL (CO2)
• SQL databases are primarily called as Relational Databases
(RDBMS); whereas NoSQL database are primarily called as non-
relational or distributed database.
• SQL databases are table based databases whereas NoSQL databases
are document based, key-value pairs, graph databases or wide-
column stores.

07/14/2025 13
• SQL databases are scaled by increasing the horse-power of the
hardware. NoSQL databases are scaled by increasing the databases
servers in the pool of resources to reduce the load.
• SQL database examples: MySql, Oracle, Sqlite, Postgres and MS-
SQL. NoSQL database examples: MongoDB, BigTable, Redis,
RavenDb, Cassandra, Hbase, Neo4j and CouchDb.

07/14/2025 14
• For the type of data to be stored: SQL databases are not best fit for
hierarchical data storage. But, NoSQL database fits better for the
hierarchical data storage as it follows the key-value pair way of
storing data similar to JSON data. NoSQL database are highly
preferred for large data set (i.e for big data).

07/14/2025 15
• For properties: SQL databases emphasizes on ACID properties
( Atomicity, Consistency, Isolation and Durability) whereas the
NoSQL database follows the Brewers CAP theorem ( Consistency,
Availability and Partition tolerance ).
• For DB types: On a high-level, we can classify SQL databases as
either open-source or close-sourced from commercial vendors.
NoSQL databases can be classified on the basis of way of storing
data as graph databases, key-value store databases, document store
databases, column store database and XML databases.

07/14/2025 16
Hirdesh Sharma RCA E45 Big Data Unit: 2
Type of NoSQL Database (CO2)
There are four general types of NoSQL databases, each with their
own specific attributes:
1. Key-Value storage
2. Document Databases
3. Column Storage
4. Graph Storage

07/14/2025 17
• Key-Value storage: This is the first category of NoSQL database.
Key-value stores have a simple data model, which allow clients to
put a map/dictionary and request value par key. In the key-value
storage, each key has to be unique to provide non-ambiguous
identification of values. For example:

07/14/2025 18
• Document databases: In the document database NoSQL store
document in JSON format. JSON-based document are store in
completely different sets of attributes can be stored together, which
stores highly unstructured data as named value pairs and
applications that look at user behavior, actions, and logs in real time.

07/14/2025 19
• Columns storage: Columnar databases are almost like tabular
databases. Thus keys in wide column store scan have many
dimensions, resulting in a structure similar to a multi-dimensional,
associative array. Shown in below example storing data in a wide
column system using a two-dimensional key.

07/14/2025 20
• Graph storage: Graph databases are best suited for representing
data with a high, yet flexible number of interconnections, especially
when information about those interconnections is at least as
important as there presented data. In NoSQL database, data is stored
in a graph like structures in graph databases, so that the data can be
made easily accessible. Graph databases are commonly used on
social networking sites. As show in below figure.

07/14/2025 21
NoSQL Database Examples (CO1)

07/14/2025 22
Advantages
– Data persistence
– Concurrency – ACID, transactions, etc.
– Integration across multiple applications
– Standard Model – tables and SQL
Disadvantages
– Impedance mismatch
– Integration databases vs. application databases
– Not designed for clustering
Pros and Cons of Relational Databases (CO2)

07/14/2025 23
CONS
Impedance Mismatch
In traditional relational databases, data is stored in tables with rows and
columns, while in most programming languages (especially object-oriented
ones), data is represented as objects.
Why it's a disadvantage:
– There’s a "mismatch" between how data is structured in the database (tables) and
how it's structured in the application (objects).
– This often requires extra work to map database tables to programming objects,
called Object-Relational Mapping (ORM).
– NoSQL tries to solve this by storing data in formats like JSON documents that
are closer to application objects, but sometimes this mismatch still exists,
especially when mixing NoSQL with relational data or legacy systems.

07/14/2025 24
Integration Databases vs. Application Databases
– Some databases are designed as integration points to bring data together
from multiple sources (integration databases).
– Others are designed primarily to support a specific application's data
needs (application databases).
– NoSQL databases are usually application-specific, optimized for a
particular app’s data access patterns.
– This can make it hard to integrate data from multiple systems or do
complex queries spanning different data sources, unlike relational
databases which excel at integration.
– So if your use case requires combining or analyzing data from many
systems, NoSQL might be less ideal.

07/14/2025 25
Not Designed for Clustering of NoSQL (in some cases)
– Clustering means running multiple database servers together to
work as a single system, providing high availability and scalability.
– Some early or simpler NoSQL databases were not built to support
clustering or distributed operation well.
– Without good clustering, databases can become a single point of
failure or struggle with scaling as data grows.
– This means they might not handle big loads or uptime requirements
as well as relational databases with mature clustering features.
– However, many modern NoSQL systems do support clustering and
distributed data, but it depends on the specific database.

07/14/2025 26
Some common characteristics of nosql include:
• Does not use the relational model (mostly)
• Generally open source projects (currently)
• Driven by the need to run on clusters
• Built for the need to run 21st century web properties
• Schema-less
• Polygot persistence
• Auto Sharding
Characteristics of NoSQL (CO2)

07/14/2025 27
• The point of view of using different data stores in different
circumstances is known as Polyglot Persistence.
• Polyglot persistence is commonly used to define this hybrid
approach.
• The definition of polyglot is “someone who speaks or writes several
languages.” The term polyglot is redefined for big data as a set of
applications that use several core database technologies.
Polygot Persistence (CO2)

07/14/2025 28
Polygot Persistence (CO2)

07/14/2025 29
• What license is Hadoop distributed under?
a) Apache License 2.0
b) Mozilla Public License
c) Shareware
d) Commercial
• Which of the following genres does Hadoop produce?
a) Distributed file system
b) JAX-RS
c) Java Message Service
d) Relational Database Management System
Daily Quiz

07/14/2025 30
Recap
• NoSQL database, also called Not Only SQL, is an approach to data
distributed data.

07/14/2025 31
Topic Name (CO2)
• NoSQL Data Models

07/14/2025 32
• NoSQL Data Models
• Aggregate Data Models

07/14/2025 33
• NoSQL databases have a very different model. For example, a
document-oriented NoSQL database takes the data you want to store
and aggregates it into documents using the JSON format.
• Each JSON document can be thought of as an object to be used by
your application.
• A JSON document might, for example, take all the data stored in a
row that spans 20 tables of a relational database and aggregate it into
a single document/object.
NoSQL Data Model (CO2)

07/14/2025 34

07/14/2025 35
• Another major difference is that relational technologies have rigid
schemas while NoSQL models are schemaless.
• The exact opposite of the behavior desired in the Big Data era,
where application developers need to constantly – and rapidly –
incorporate new types of data to enrich their apps.

07/14/2025 36
Aggregate Data Model in NoSQL (CO2)
• Data Model: A data model is the model through which we perceive
and manipulate our data.
• Relational Data Model: The relational model takes the information
that we want to store and divides it into tuples.
• Aggregate Model: Aggregate is a term that comes from Domain-
Driven Design, an aggregate is a collection of related objects.

07/14/2025 37
• Atomic property holds within an aggregate.
• Communication with data storage happens in unit of aggregate.
Example of Relations and Aggregates
• Let’s assume we have to build an e-commerce website; we are going
to be selling items directly to customers over the web.
• We can use this scenario to model the data using a relation data store
as well as NoSQL data stores and talk about their pros and cons.

07/14/2025 38

07/14/2025 39
• The following figure presents some sample data for this model.

07/14/2025 40
• In relational, everything is properly normalized. We also have
referential integrity. A realistic order system would naturally be
more involved than this.

07/14/2025 41
Again, we have some sample data, which we’ll show in JSON
format as that’s a common representation for data in NoSQL.
// in customers
{ "
id":1,
"name":"Martin",
"billingAddress":[{"city":"Chicago"}]
}
// in orders
{ "
id":99,
"customerId":1,
"orderItems":[

07/14/2025 42
{
"productId":27,
"price": 32.45,
"productName": "NoSQL Distilled"
}
],
"shippingAddress":[{"city":"Chicago"}]
"orderPayment":[
{
"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft",
"billingAddress": {"city": "Chicago"}
}
],
}

07/14/2025 43
• In this model, we have two main aggregates: customer and order.
We’ve used the black-diamond composition marker in UML to show
how data fits into the aggregation structure.
• The customer contains a list of billing addresses; the order contains
a list of order items, a shipping address, and payments.
• The payment itself contains a billing address for that payment.

07/14/2025 44
Aggregate Oriented Databases (CO2)
Aggregate-oriented databases work best when most data interaction
is done with the same aggregate;
Key-value databases
– Stores data that is opaque to the database
– The database does cannot see the structure of records
– Application needs to deal with this
– Allows flexibility regarding what is stored (i.e. text or binary
data)
Document databases
– Stores data whose structure is visible to the database
– Imposes limitations on what can be stored
– Allows more flexible access to data (i.e. partial records) via
querying

07/14/2025 45
Both key-value and document databases consist of aggregate records
accessed by ID values
Column-family databases
– Two levels of access to aggregates (and hence, two pars to the
“key” to access an aggregate’s data)
– ID is used to look up aggregate record
– Column name – either a label for a value (name) or a key to a list
entry (order id)
– Columns are grouped into column families
Aggregate Oriented Databases (CO2)

07/14/2025 46
Relational Vs Aggregate Data Models (CO2)

07/14/2025 47
Relational Vs Aggregate Data Models (CO2)

07/14/2025 48
Schemaless Databases (CO2)
• A common theme across all the forms of NoSQL databases is that
they are schemaless.
• When you want to store data in a relational database, you first have
to define a schema—a defined structure for the database which says
what tables exist, which columns exist, and what data types each
column can hold.
• Before you store some data, you have to have the schema defined
for it.

07/14/2025 49
Why schemaless?
– A schemaless store also makes it easier to deal with nonuniform
data
– When starting a new development project you don't need to
spend the same amount of time on up-front design of the
schema.
– No need to learn SQL or database specific stuff and tools.
– The rigid schema of a relational database (RDBMS). It can be
harder to push data into the DB as it has to perfectly fit the
schema.

07/14/2025 50
Pros:
– More freedom and flexibility
– you can easily change your data organization
– you can deal with non uniform data
Cons:
– A program that accesses data: almost always relies on some form
of implicit schema, it assumes that certain fields are present ,
carry data with a certain meaning
– The implicit schema is shifted into the application code that
accesses data

07/14/2025 51
Example:
You store user data in a schemaless database like MongoDB:
json
// In the database { "username": "sana123", "email":
"sana@example.com", "age": 25 }
In your application code:
python
def send_email(user):
if "@example.com" in user["email"]:
email_user(user["email"])
This code assumes:
•user["email"] exists
•It’s a string
•It contains an email
If email is missing or of the wrong type, the code may crash.

07/14/2025 52
Multiple servers:
– In NoSQL systems, data distributed over large clusters.
Single server:
– simplest model, everything on one machine. Run the database on
a single machine that handles all the reads and writes to the data
store.
Distribution Models (CO2)

07/14/2025 53
Sharding:
• DB Sharding is nothing but horizontal partitioning of data. Different
people are accessing different parts of the dataset.
• In these circumstances we can support horizontal scalability by
putting different parts of the data onto different servers—a technique
that’s called sharding.
Orthogonal aspects of data distribution models (C02)

07/14/2025 55
• Different parts of the data onto different servers
– Horizontal scalability
– Ideal case: different users all talking to different server nodes
– Data accessed together on the same node aggregate unit!
̶
• Pros: it can improve both reads and writes
• Cons: Clusters use less reliable machines resilience decreases
̶
• Many NoSQL databases offer auto-sharding
– the database takes on the responsibility of sharding
Sharding (C02)

07/14/2025 56
Sharding (CO2)
Improving performance:
Main rules of sharding:
• Place the data close to where it’s accessed
– Orders for Boston: data in your eastern US data center
– If users in Boston access their data from a server in California, it will
be slower.
– Storing Boston user data in a data center in the eastern US improves
performance.
• Try to keep the load even
– All nodes should get equal amounts of the load
•Storing Boston user data in a data center in the eastern US improves performance.
lacing all items for a single order on the same shard makes reads/writes faster and simple

Unit: 2
57
• Put together aggregates that may be read in sequence
 Same order, same node
 If one user’s order is spread across multiple shards, reading or updating it
becomes expensive.
 Placing all items for a single order on the same shard makes reads/writes
faster and simpler.

07/14/2025 58
Master Slave Replication (CO2)
Master
– is the authoritative source for the data
– is responsible for processing any updates to that data
– can be appointed manually or automatically
Slaves
– A replication process synchronizes the slaves with the master
– After a failure of the master, a slave can be appointed as new
master very quickly

07/14/2025 59

07/14/2025 60
Pros and cons of Master-Slave Replication
Pros
– More read requests:
– Add more slave nodes
– Ensure that all read requests are routed to the slaves
Cons
– The master is a bottleneck
– Limited by its ability to process updates and to pass those
updates on
– Its failure does eliminate the ability to handle writes until:

07/14/2025 61
Peer to Peer Replication (CO2)

07/14/2025 62
• All the replicas have equal weight, they can all accept writes
• The loss of any of them doesn’t prevent access to the data store.
Pros and cons of peer-to-peer replication
Pros:
– you can ride over node failures without losing access to data
– you can easily add nodes to improve your performance
Cons:
– Inconsistency
– Slow propagation of changes to copies on different nodes
Peer to Peer Replication (CO2)

07/14/2025 63
• Replication and sharding are strategies that can be combined.
• If we use both master slave replication and sharding, this means that
we have multiple masters, but each data item only has a single
master.
• We have multiple masters, but each data only has a single master.
Two schemes:
– A node can be a master for some data and slaves for others
– Nodes are dedicated for master or slave duties
Sharding and Replication on Master-Slave (CO2)

07/14/2025 64
Sharding and Replication on Master-Slave (CO2)

07/14/2025 65
• Using peer-to-peer replication and sharding is a common strategy
for column family databases.
• Usually each shard is present on three nodes.
• A common strategy for column-family databases.
Sharding and Replication on P2P (CO2)

07/14/2025 66
Sharding and Replication on P2P (CO2)

07/14/2025 67
Key Points:
– Sharding distributes different data across multiple servers, so
each server acts as the single source for a subset of data.
– Replication copies data across multiple servers, so each bit of
data can be found in multiple places.
A system may use either or both techniques. Replication comes in
two forms:
– Master-slave replication makes one node the authoritative copy
that handles writes while slaves synchronize with the master and
may handle reads.
– Peer-to-peer replication allows writes to any node; the nodes
coordinate to synchronize their copies of the data.
Replication (CO2)

07/14/2025 68
Recap (CO2)
• Replication and sharding are strategies that can be combined.
• We have multiple masters, but each data only has a single master.
Two schemes:
– A node can be a master for some data and slaves for others
– Nodes are dedicated for master or slave duties

07/14/2025 69
Topic Name (CO2)
• Consistency and Version Stamp
• Positioning and Combining

07/14/2025 70
• Consistency
• Version Stamp
• Positioning and Combining

07/14/2025 71
• The consistency property ensures that any transaction will bring the
database from one valid state to another.
• Relational databases has strong consistency whereas NoSQL
systems hass mostly eventual consistency.
• ACID: A DBMS is expected to support “ACID transactions,”
processes that are:
– Atomicity: either the whole process is done or none is
– Consistency: only valid data are written
– Isolation: one operation at a time
– Durability: once committed, it stays that way
Consistency (CO2)

07/14/2025 72
Update Consistency (or write-write conflict):
• Martin and Pramod are looking at the company website and notice
that the phone number is out of date. Incredibly, they both have
update access, so they both go in at the same time to update the
number.
Various forms of Consistency (CO2)

07/14/2025 73
Update Consistency (or write-write conflict):
Solutions:
– Pessimistic approach
– Prevent conflicts from occurring
Approaches:
– conditional updates: test the value just before updating
– Do not work if there’s more than one server (peer-to-peer
replication)

07/14/2025 74
Read Consistency (or read-write conflict)
• Alice and Bob are using Ticketmaster website to book tickets for a
specific show.
• Only one ticket is left for the specific show. Alice signs on to
Ticketmaster first and finds one left, and finds it expensive. Alice
takes time to decide.

07/14/2025 75
Read Consistency (or read-write conflict)

07/14/2025 76
Replication consistency
• Let’s imagine there’s one last hotel room for a desirable event. The
hotel reservation system runs on many nodes.
• This is another inconsistent read—but it’s a breach of a different
form of consistency we call replication consistency: ensuring that
the same data item has the same value when read from different
replicas.

07/14/2025 77
Replication consistency

07/14/2025 78
Eventual consistency:
• At any time, nodes may have replication inconsistencies but, if there
are no further updates, eventually all nodes will be updated to the
same value.
• In other words, Eventual consistency is a consistency model used in
nosql database to achieve high availability that informally
guarantees that, if no new updates are made to a given data item,
eventually all accesses to that item will return the last updated value.

07/14/2025 79
In Big Data systems (especially distributed databases), a
version stamp is a technique used to track changes and
resolve conflicts in replicated or distributed data.
Version stamp
What Is a Version Stamp?
A version stamp is a marker (often a number, timestamp, or
vector) assigned to a piece of data that changes every time the
data is updated.
Purpose: To know which version of the data is the most recent
and to detect or resolve conflicts when the same data is
updated in different places.

07/14/2025 80
Why Is It Needed in Big Data?
• In distributed systems, data is stored across multiple
machines (nodes). Updates might happen:
• At different times
• On different nodes
• Simultaneously (leading to conflicts)
A version stamp helps to:
• Identify the latest version
• Ensure eventual consistency
• Support conflict resolution

Unit: 2
81
Type Description
Timestamps
Attach a system time to each update
(e.g., 2025-07-04T14:00:00Z)
Incrementing Counters
Each update increases a version
number (e.g., v1 → v2 → v3)
Vector Clocks
Advanced version tracking across
multiple nodes to detect concurrent
changes
Common Types of Version Stamps

07/14/2025 82
• Example Scenario: Version Stamp with
Timestamps
• Let’s say you have a product record in an e-
commerce system stored in three replicas:
• Replica A: Price = $100, Timestamp = 12:00 PM
• Replica B: Price = $90, Timestamp = 12:05 PM
• Replica C: Price = $100, Timestamp = 12:00 PM
• 🔍 The system compares the version stamps
(timestamps) and determines Replica B has
the most recent version.

07/14/2025 83
System How Versioning Is Used
Apache Cassandra
Uses timestamps to resolve
conflicting writes
Amazon DynamoDB Uses vector clocks internally
HBase
Each cell can store multiple
versions (by timestamp)
MongoDB
Supports write concern and
timestamps for consistency
Used In:

07/14/2025 84
• The CAP Theorem: The basic statement of the CAP theorem is that,
given the three properties of Consistency, Availability, and Partition
tolerance, you can only get two.
• Consistency Every read receives the most recent write or an
error. All nodes see the same data at the same time.
• Availability
Every request (read or write) receives a response, without
guarantee that it contains the most recent data.
Relaxing Consistency (CO2)
•Partition Tolerance (P):
The system continues to operate despite arbitrary message loss
or failure of part of the system (network partitions).

07/14/2025 85
Relaxing Consistency (CO2)

07/14/2025 86
• The CAP theorem states that if you get a network partition, you have
to trade off availability of data versus consistency.
• Very large systems will “partition” at some point::
Network Partition (CO2)

07/14/2025 87
A single-server system is the obvious example of a CA system:
– CA cluster: if a partition occurs, all the nodes would go down
– A failed, unresponsive node doesn’t infer a lack of CAP
availability
• If the system is CA (Consistency + Availability) but NOT Partition
tolerant:
• CA means:
– The system always responds (Availability)
– All nodes have consistent, up-to-date data (Consistency)
• But it cannot tolerate partitions (network failures between nodes).
An example
– Ann is trying to book a room of the Ace Hotel in New York on a
node located in London of a booking system
– Pathin is trying to do the same on a node located in Mumbai
CA System (CO2)

07/14/2025 88
Possible solutions
– CP: Neither user can book any hotel room, sacrificing
availability
– CAP: Designate Mumbai node as the master for Ace hotel
CA System (CO2)

07/14/2025 89
• It is a way to take a big task and divide it into discrete tasks that can
be done in parallel.
• A common use case for Map/Reduce is in document database .
• A Map Reduce program is composed of a Map() procedure that
performs filtering and sorting and a Reduce() procedure that
performs a summary operation.
• "Map" step
• "Reduce" step
Map Reduce (CO2)

07/14/2025 90
Logical view
• The Map function is applied in parallel to every pair in the input
dataset.
• Map(k1,v1) → list(k2,v2)
• The Reduce function is then applied in parallel to each group, which in
turn produces a collection of values in the same domain:
• Reduce(k2, list (v2)) → list(v3)
• Each Reduce call typically produces either one value v3 or an empty
return
Map Reduce (CO2)

07/14/2025 91
Map Reduce (CO2)

07/14/2025 92
Map Reduce (CO2)

07/14/2025 93
Map Reduce (CO2)

07/14/2025 94
Map Reduce (CO2)
Multistage map-reduce calculations: Let us say that we have a set of
documents and its attributes with the following form:
{
"type": "post",
"name": "Raven's Map/Reduce functionality",
"blog_id": 1342,
"post_id": 29293921,
"tags": ["raven", "nosql"],
"post_content": "<p>...</p>",
"comments": [
{
"source_ip": '124.2.21.2',
"author": "martin",
"text": "excellent blog..."
}]
}

07/14/2025 95
Map Reduce (CO2)
• Let us see how this works, we start by applying the map query to the
set of documents that we have, producing this output:

07/14/2025 96
Map Reduce (CO2)
• The next step is to start reducing the results, in real Map/Reduce
algorithms, we partition the original input, and work toward the
final result.

07/14/2025 97
Map Reduce (CO2)

07/14/2025 98
Map Reduce (CO2)
• And the final step is:

07/14/2025 99
RDBMS compared to MapReduce (CO2)
• MapReduce is a good fit for problems that need to analyze the
whole dataset, in a batch fashion, particularly for ad hoc analysis.
• MapReduce suits applications where the data is written once, and
read many times, whereas a relational database is good for datasets
that are continually updated.

07/14/2025 100
RDBMS compared to MapReduce (CO2)

07/14/2025 101
Portioning and Combining (CO2)
• In the simplest form, we think of a map-reduce job as having a
single reduce function.
• The outputs from all the map tasks running on the various nodes are
concatenated together and sent into the reduce.

07/14/2025 102
• To take advantage of this, the results of the mapper are divided up
based the key on each processing node.
• Typically, multiple keys are grouped together into partitions. The
framework then takes the data from all the nodes for one partition,
combines it into a single group for that partition, and sends it off to a
reducer.

07/14/2025 103
• Reduce Partitioning Example:

07/14/2025 104
• Combinable Reducer Example:

07/14/2025 105
• Combinable Reducer:
A combiner function is, in essence, a reducer function—indeed, in
many cases the same function can be used for combining as the final
reduction. The reduce function needs a special shape for this to
work: Its output must match its input. We call such a function a
combinable reducer.

07/14/2025 106
• Mapper and Reducer implementations can use the ________ to
report progress or just indicate that they are alive.
a) Partitioner
b) OutputCollector
c) Reporter
d) All of the mentioned
• _________ is the primary interface for a user to describe a
MapReduce job to the Hadoop framework for execution.
a) Map Parameters
b) JobConf
c) MemoryConf
d) None of the mentioned
Daily Quiz

07/14/2025 107
• The consistency property ensures that any transaction will bring the
database from one valid state to another.
• Relational databases has strong consistency whereas NoSQL
systems hass mostly eventual consistency.
Recap

07/14/2025 108
• https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
• https://www.tutorialspoint.com/hadoop/hadoop_mapreduce
.htm
• https://www.sanfoundry.com/mapreduce-questions-answers/
Faculty Video Links, Youtube & NPTEL Video Links and Online
Courses Details

07/14/2025 109
Q:1 Explain the concept of NoSQL in Big Data.
Q:2 Give the difference between Relation database and NoSQL
database.
Q:3 Explain various aggregate data modes:
Key value storage
Document Storage
Graph Storage
Colum value Storage
Q:4 Explain the Schema less database. Also explain the properties
of schema less database.
Weekly Assignment 1

07/14/2025 110
Q:1 Write a short note on following terms:
–Materialized Views
–Distribution Models
–Sharing
Q:2 Explain the following terms:
–Version Stamps
–Map Reduce Calculations
–Portioning and Combining
–Consistency
Q:3 Explain Master Slave Replication with the help of suitable
example.
Q:4 Explain Peer to peer Replication with the help of suitable
example.
Weekly Assignment 2

07/14/2025 111
• Point out the correct statement.
a) MapReduce tries to place the data and the compute as close as
possible
b) Map Task in MapReduce is performed using the Mapper() function
c) Reduce Task in MapReduce is performed using the Map() function
d) All of the mentioned
• Point out the correct statement.
a) Hadoop is an ideal environment for extracting and transforming
small volumes of data
b) Hadoop stores data in HDFS and supports data
compression/decompression
c) The Giraph framework is less useful than a MapReduce job to solve
graph and machine learning
d) None of the mentioned
MCQ s

07/14/2025 112
• What license is Hadoop distributed under?
a) Apache License 2.0
b) Mozilla Public License
c) Shareware
d) Commercial
• Which of the following genres does Hadoop produce?
a) Distributed file system
b) JAX-RS
c) Java Message Service
d) Relational Database Management System
MCQ s

07/14/2025 113
Old Question Papers

07/14/2025 114
Old Question Papers

07/14/2025 115
Old Question Papers

07/14/2025 116
Old Question Papers

07/14/2025 117
Q:1 Write a short note on following terms:
–Materialized Views
–Distribution Models
–Sharing
Q:2 Explain the following terms:
–Version Stamps
–Map Reduce Calculations
–Portioning and Combining
–Consistency
Q:3 Explain Master Slave Replication with the help of suitable
example.
Q:4 Explain Peer to peer Replication with the help of suitable
example.
Expected Questions for University Exam

07/14/2025 118
Q:5 Explain the concept of NoSQL in Big Data.
Q:6 Give the difference between Relation database and NoSQL
database.
Q:7 Explain various aggregate data modes:
Key value storage
Document Storage
Graph Storage
Colum value Storage
Q:8 Explain the Schema less database. Also explain the properties
of schema less database.
Expected Questions for University Exam

07/14/2025 119
• No SQL database, also called Not Only SQL, is an approach to data
distributed data.
• SQL databases have predefined schema whereas No SQL databases
have dynamic schema for unstructured data.
• There are four general types of No SQL databases, each with their
own specific attributes:
–Key value storage
–Document Storage
–Graph Storage
–Colum value Storage
Summary

07/14/2025 120
References
1. Michael Minelli, Michelle Chambers, and Ambiga Dhiraj, "Big Data, Big
Analytics: Emerging Business Intelligence and Analytic Trends for Today's
Businesses", Wiley, 2013.
2. P. J. Sadalage and M. Fowler, "NoSQL Distilled: A Brief Guide to the Emerging
World of
3. Polyglot Persistence", Addison-Wesley Professional, 2012.
4. Tom White, "Hadoop: The Definitive Guide", Third Edition, O'Reilley, 2012.
5. Eric Sammer, "Hadoop Operations", O'Reilley, 2012.
6. E. Capriolo, D. Wampler, and J. Rutherglen, "Programming Hive", O'Reilley,
2012.
7. Lars George, "HBase: The Definitive Guide", O'Reilley, 2011.
8. Eben Hewitt, "Cassandra: The Definitive Guide", O'Reilley, 2010.
9. Alan Gates, "Programming Pig", O'Reilley, 2011.
Thank You

NOSQL IN BIGDATA FOR PG STUDENTS FOR COL

More Related Content

Similar to NOSQL IN BIGDATA FOR PG STUDENTS FOR COL

More from DharaniMani4

Recently uploaded

NOSQL IN BIGDATA FOR PG STUDENTS FOR COL