Big Data: RDBMS vs. Hadoop vs. Spark

RDBMS Vs
Hadoop Vs
Spark
PRESENTED BY
AMIT SK
ANOUSHKA VIEGAS
GRAISY BISWAL
KRISHNA KUMAR
Group 3

Agenda
Common Definitions of Big Data
The Relationship Between Big Data &
RDBMS, Spark and Hadoop
Introduction & Core Competencies
Functional Differences & Applications
Benefits over Each Other
RDBMS Vs Hadoop Vs Spark

Common Definitions of Big Data
SAS
Big data is a term
that describes the
large volume of data
both structured and
unstructured – that
inundates a business
on a day-to-day
basis.
ORACLE
The definition of big
data is data that
contains greater
variety, arriving in
increasing volumes
and with more
velocity. Also known
as the three Vs.
IBM
Big data analytics
uses advanced
analytic techniques
against large,
diverse data sets
that include
structured and
unstructured data,
from different
sources, and in
different sizes.
SUMMARIZING
Big Data has the
feature of 3Vs, that
can be Structured,
Semi-Structured,
Unstructured and
that needs advanced
analytical systems
and techniques to
make the most or
anything out of it

The Relationship Between Big Data
& RDBMS, Spark and Hadoop
HADOOP AND SPARK
Apache Hadoop and Spark are the two major frameworks upon which
the concept of Big Data runs. These two, are the two titans of the Big
Data world.
Each Framework has its pros and cons and the company choosing the
framework should be wise enough to choose the right one.
RELATIONAL DATABASE MANAGEMENT SYSTEM
RBDMS fits into the system of Traditional Information
Management System that stores and manages structured
data. Has been in use since the '80s and in recent times, often
compared with BigData

Introduction
RDBMS
A relational database is a type of
database that stores and
provides access to data points
that are related to one another.
Each row in the table is a record
with a unique ID called the key.
A relational database can be
considered for any information
need in which data points relate
to each other and must be
managed in a secure, rules-
based, consistent way.
HADOOP
An open-source platform
developed by Doug Cutting and
Michael J. Cafarella.
It has a Java framework and is
mostly used for storing and
processing big data.
Data is stored on low-cost
servers clustered together.
Distributed file system helps in
concurrent processing and fault
tolerance
SPARK
It is an open-source cluster
computing designed for fast
computation.
It provides an interface for
programming entire clusters with
implicit data parallelism and fault
tolerance.
The main feature of Spark is in-
memory cluster computing that
increases the speed of an
application.

Core Competencies
RDBMS
Being a facility to process
structured data, RDMS provides
indexing that facilitates fast data
retrieval.
The concept of the foreign key is
used to share a column among
tables increasing relationship
maintenance.
Multi-user accessibility and
virtual table creation are other
core competencies.
HADOOP
High scalability in Hadoop unlike
RDBMS makes it a popular big
data processing tool
Fault tolerance is provided via
replication on multiple
DataNodes.
Can deal with all kinds of data.
Data locality that moves
processing closer to data makes
the processing fast and reduces
most
SPARK
Sprak provided a faster version
of data processing than Hadoop
with the use of Resilient
Distributed dataset (RDD).
It provides flexibility to
programmers to write
applications in their preferred
language.
Beyond Map Reduce, spark also
supports ML algorithms and SQL
Queries.

Functional Differences
RDBMS
Contains N number of tables, and each
table has its own unique primary key.
The row is known as a Record that
holds information about the individual
entry.
The column is known as a Field that
holds information about a specific
field.
When the user fires a query, it shows
results for specific queries.
Important concepts in Relational
DBMS, Not Null, Unique, Check,
Primary Key, Foreign Key, Data
Integrity,.
HADOOP
Hadoop runs the app using the
MapReduce algorithm, processing in
parallel on different CPU nodes.
The core of Hadoop is built on a
storage part called Hadoop
Distributed File System and a
processing part called the MapReduce
programming model.
Hadoop splits files into large blocks
and distributes them across the
clusters, transfer package code into
nodes to process data in parallel.
SPARK
Spark extends the MapReduce model
of Hadoop to efficiently use more
types of computations which include
Interactive Queries and Stream
Processing.
Spark utilizes Hadoop in two ways –
one is storage and the second is
processing. Since Spark has its own
Cluster Management, it uses Hadoop
for storage purposes only.
Spark also supports SQL queries,
Streaming data, Machine learning,
and Graph Algorithms.

Applications
RDBMS
Can be considered for any information
need in which data points relate to
each other and must be managed in a
secure, rules-based, consistent way.
Relational databases are used to Track
inventories, Process ecommerce
transactions and Manage huge
amounts of mission-critical customer
information
When a customer deposits money at
an ATM and then looks at the account
balance on a mobile phone, the
customer expects to see that deposit
reflected immediately in an updated
account balance a RDBMS excel at this
HADOOP
Finance sectors - for fraud detection
and prevention. .
Security and Law Enforcement - to
prevent terrorist attacks and to detect
and prevent cyber-attacks.
Retail industry - keeping track
customer behaviour
Real-time analysis of customers data
- It can track and process high
volumes of clickstream data.
Other applications: Advertisements
Targeting Platforms, Sentiment
analysis, Financial Trading and
Forecasting, Healthcare sectors,
Optimizing machine performance
SPARK
Processing Streaming Data - It can
efficiently handle unifying disparate data
processing capabilities and allow
developers to use a single framework to
accommodate all their processing
requirements.
Machine Learning - It is equipped with an
integrated framework for performing
advanced analytics that allows you to run
repeated queries on datasets.
Network security - Security
providers/companies can inspect data
packets for detecting any traces of
malicious activity.
Fog Computing - As IoT continues to
expand, there arises a need for a scalable
distributed parallel processing system for
processing vast amounts of data.

Benefits over Each Other
RDBMS
Commitment and atomicity - Relational
databases handle business rules and
policies at a very granular level, with
strict policies about commitment (that
is, making a change to the database
permanent).
Database locking and concurrency -
Conflicts can arise in a database when
multiple users or applications attempt
to change the same data at the same
time. Locking and concurrency
techniques reduce the potential for
conflicts while maintaining the integrity
of the data.
HADOOP
Data Storage- RDBMS use average
size data whereas Hadoop uses large
data sets.
Speed-In RDBMS reads are fast
whereas in Hadoop both reads and
writes are fast.
Cost - RDBMS requires a costly
licence and Spark relies on in-memory
computations for real-time data
processing (needs high RAM) making
Hadoop the cheapest of the three
Integrity - RDBMS has higher integrity
than Hadoop (this is good as it allows
for data processing)
SPARK
Performance: Spark is faster because
it uses random access memory (RAM)
instead of reading and writing
intermediate data to disks. Hadoop
stores data on multiple sources and
processes it in batches via
MapReduce.
Security: Spark enhances security
with authentication via shared secret
or event logging, whereas Hadoop
uses multiple authentication and
access control methods. Though,
overall, Hadoop is more secure, Spark
can integrate with Hadoop to reach a
higher security level.

RDBMS Vs Hadoop Vs Spark
Tabular Data Structure
Secure Network
SQL support
Automated maintenance
Multi-user access
Authorization control
RDBMS - Pros
Maintenance cost
Field length limits
Slower than many other databases
High physical memory use
Data complexity and data loss risk
RDBMS - Cons
Resilience - There is always a backup of
the data available in the cluster.
Scalability - Setup can be easily
expanded to include more servers that
can store up additional data.
Low cost
Speed
Data diversity
HADOOP - Pros
Steep learning curve
Different datasets require different
approaches
Limitations of MapReduce
Data security
HADOOP - Cons
No File Management System
Few Algorithms
Small Files Issues
No Automatic Optimization Process
Not a Multi User Environment
Friendly
SPARK - Cons
Dynamic in Nature - around 80 high-
level operators
Powerful - Low latency in-memory data
processing
Advanced Analytics
Reusable codes for Batch processing,
ad-hoc queries
Real-time stream processing
SPARK - Pros

Big Data: RDBMS vs. Hadoop vs. Spark

More Related Content

What's hot

Similar to Big Data: RDBMS vs. Hadoop vs. Spark

Recently uploaded

Big Data: RDBMS vs. Hadoop vs. Spark