RDBMS Vs
Hadoop Vs
Spark
PRESENTED BY
AMIT SK
ANOUSHKA VIEGAS
GRAISY BISWAL
KRISHNA KUMAR
Group 3
Agenda
Common Definitions of Big Data
The Relationship Between Big Data &
RDBMS, Spark and Hadoop
Introduction & Core Competencies
Functional Differences & Applications
Benefits over Each Other
RDBMS Vs Hadoop Vs Spark
Common Definitions of Big Data
SAS
Big data is a term
that describes the
large volume of data
both structured and
unstructured – that
inundates a business
on a day-to-day
basis.
ORACLE
The definition of big
data is data that
contains greater
variety, arriving in
increasing volumes
and with more
velocity. Also known
as the three Vs.
IBM
Big data analytics
uses advanced
analytic techniques
against large,
diverse data sets
that include
structured and
unstructured data,
from different
sources, and in
different sizes.
SUMMARIZING
Big Data has the
feature of 3Vs, that
can be Structured,
Semi-Structured,
Unstructured and
that needs advanced
analytical systems
and techniques to
make the most or
anything out of it
The Relationship Between Big Data
& RDBMS, Spark and Hadoop
HADOOP AND SPARK
Apache Hadoop and Spark are the two major frameworks upon which
the concept of Big Data runs. These two, are the two titans of the Big
Data world.
Each Framework has its pros and cons and the company choosing the
framework should be wise enough to choose the right one.
RELATIONAL DATABASE MANAGEMENT SYSTEM
RBDMS fits into the system of Traditional Information
Management System that stores and manages structured
data. Has been in use since the '80s and in recent times, often
compared with BigData
Introduction
RDBMS
A relational database is a type of
database that stores and
provides access to data points
that are related to one another.
Each row in the table is a record
with a unique ID called the key.
A relational database can be
considered for any information
need in which data points relate
to each other and must be
managed in a secure, rules-
based, consistent way.
HADOOP
An open-source platform
developed by Doug Cutting and
Michael J. Cafarella.
It has a Java framework and is
mostly used for storing and
processing big data.
Data is stored on low-cost
servers clustered together.
Distributed file system helps in
concurrent processing and fault
tolerance
SPARK
It is an open-source cluster
computing designed for fast
computation.
It provides an interface for
programming entire clusters with
implicit data parallelism and fault
tolerance.
The main feature of Spark is in-
memory cluster computing that
increases the speed of an
application.
Core Competencies
RDBMS
Being a facility to process
structured data, RDMS provides
indexing that facilitates fast data
retrieval.
The concept of the foreign key is
used to share a column among
tables increasing relationship
maintenance.
Multi-user accessibility and
virtual table creation are other
core competencies.
HADOOP
High scalability in Hadoop unlike
RDBMS makes it a popular big
data processing tool
Fault tolerance is provided via
replication on multiple
DataNodes.
Can deal with all kinds of data.
Data locality that moves
processing closer to data makes
the processing fast and reduces
most
SPARK
Sprak provided a faster version
of data processing than Hadoop
with the use of Resilient
Distributed dataset (RDD).
It provides flexibility to
programmers to write
applications in their preferred
language.
Beyond Map Reduce, spark also
supports ML algorithms and SQL
Queries.
Functional Differences
RDBMS
Contains N number of tables, and each
table has its own unique primary key.
The row is known as a Record that
holds information about the individual
entry.
The column is known as a Field that
holds information about a specific
field.
When the user fires a query, it shows
results for specific queries.
Important concepts in Relational
DBMS, Not Null, Unique, Check,
Primary Key, Foreign Key, Data
Integrity,.
HADOOP
Hadoop runs the app using the
MapReduce algorithm, processing in
parallel on different CPU nodes.
The core of Hadoop is built on a
storage part called Hadoop
Distributed File System and a
processing part called the MapReduce
programming model.
Hadoop splits files into large blocks
and distributes them across the
clusters, transfer package code into
nodes to process data in parallel.
SPARK
Spark extends the MapReduce model
of Hadoop to efficiently use more
types of computations which include
Interactive Queries and Stream
Processing.
Spark utilizes Hadoop in two ways –
one is storage and the second is
processing. Since Spark has its own
Cluster Management, it uses Hadoop
for storage purposes only.
Spark also supports SQL queries,
Streaming data, Machine learning,
and Graph Algorithms.
Applications
RDBMS
Can be considered for any information
need in which data points relate to
each other and must be managed in a
secure, rules-based, consistent way.
Relational databases are used to Track
inventories, Process ecommerce
transactions and Manage huge
amounts of mission-critical customer
information
When a customer deposits money at
an ATM and then looks at the account
balance on a mobile phone, the
customer expects to see that deposit
reflected immediately in an updated
account balance a RDBMS excel at this
HADOOP
Finance sectors - for fraud detection
and prevention. .
Security and Law Enforcement - to
prevent terrorist attacks and to detect
and prevent cyber-attacks.
Retail industry - keeping track
customer behaviour
Real-time analysis of customers data
- It can track and process high
volumes of clickstream data.
Other applications: Advertisements
Targeting Platforms, Sentiment
analysis, Financial Trading and
Forecasting, Healthcare sectors,
Optimizing machine performance
SPARK
Processing Streaming Data - It can
efficiently handle unifying disparate data
processing capabilities and allow
developers to use a single framework to
accommodate all their processing
requirements.
Machine Learning - It is equipped with an
integrated framework for performing
advanced analytics that allows you to run
repeated queries on datasets.
Network security - Security
providers/companies can inspect data
packets for detecting any traces of
malicious activity.
Fog Computing - As IoT continues to
expand, there arises a need for a scalable
distributed parallel processing system for
processing vast amounts of data.
Benefits over Each Other
RDBMS
Commitment and atomicity - Relational
databases handle business rules and
policies at a very granular level, with
strict policies about commitment (that
is, making a change to the database
permanent).
Database locking and concurrency -
Conflicts can arise in a database when
multiple users or applications attempt
to change the same data at the same
time. Locking and concurrency
techniques reduce the potential for
conflicts while maintaining the integrity
of the data.
HADOOP
Data Storage- RDBMS use average
size data whereas Hadoop uses large
data sets.
Speed-In RDBMS reads are fast
whereas in Hadoop both reads and
writes are fast.
Cost - RDBMS requires a costly
licence and Spark relies on in-memory
computations for real-time data
processing (needs high RAM) making
Hadoop the cheapest of the three
Integrity - RDBMS has higher integrity
than Hadoop (this is good as it allows
for data processing)
SPARK
Performance: Spark is faster because
it uses random access memory (RAM)
instead of reading and writing
intermediate data to disks. Hadoop
stores data on multiple sources and
processes it in batches via
MapReduce.
Security: Spark enhances security
with authentication via shared secret
or event logging, whereas Hadoop
uses multiple authentication and
access control methods. Though,
overall, Hadoop is more secure, Spark
can integrate with Hadoop to reach a
higher security level.
RDBMS Vs Hadoop Vs Spark
Tabular Data Structure
Secure Network
SQL support
Automated maintenance
Multi-user access
Authorization control
RDBMS - Pros
Maintenance cost
Field length limits
Slower than many other databases
High physical memory use
Data complexity and data loss risk
RDBMS - Cons
Resilience - There is always a backup of
the data available in the cluster.
Scalability - Setup can be easily
expanded to include more servers that
can store up additional data.
Low cost
Speed
Data diversity
HADOOP - Pros
Steep learning curve
Different datasets require different
approaches
Limitations of MapReduce
Data security
HADOOP - Cons
No File Management System
Few Algorithms
Small Files Issues
No Automatic Optimization Process
Not a Multi User Environment
Friendly
SPARK - Cons
Dynamic in Nature - around 80 high-
level operators
Powerful - Low latency in-memory data
processing
Advanced Analytics
Reusable codes for Batch processing,
ad-hoc queries
Real-time stream processing
SPARK - Pros
Thank You

Big Data: RDBMS vs. Hadoop vs. Spark

  • 1.
    RDBMS Vs Hadoop Vs Spark PRESENTEDBY AMIT SK ANOUSHKA VIEGAS GRAISY BISWAL KRISHNA KUMAR Group 3
  • 2.
    Agenda Common Definitions ofBig Data The Relationship Between Big Data & RDBMS, Spark and Hadoop Introduction & Core Competencies Functional Differences & Applications Benefits over Each Other RDBMS Vs Hadoop Vs Spark
  • 3.
    Common Definitions ofBig Data SAS Big data is a term that describes the large volume of data both structured and unstructured – that inundates a business on a day-to-day basis. ORACLE The definition of big data is data that contains greater variety, arriving in increasing volumes and with more velocity. Also known as the three Vs. IBM Big data analytics uses advanced analytic techniques against large, diverse data sets that include structured and unstructured data, from different sources, and in different sizes. SUMMARIZING Big Data has the feature of 3Vs, that can be Structured, Semi-Structured, Unstructured and that needs advanced analytical systems and techniques to make the most or anything out of it
  • 4.
    The Relationship BetweenBig Data & RDBMS, Spark and Hadoop HADOOP AND SPARK Apache Hadoop and Spark are the two major frameworks upon which the concept of Big Data runs. These two, are the two titans of the Big Data world. Each Framework has its pros and cons and the company choosing the framework should be wise enough to choose the right one. RELATIONAL DATABASE MANAGEMENT SYSTEM RBDMS fits into the system of Traditional Information Management System that stores and manages structured data. Has been in use since the '80s and in recent times, often compared with BigData
  • 5.
    Introduction RDBMS A relational databaseis a type of database that stores and provides access to data points that are related to one another. Each row in the table is a record with a unique ID called the key. A relational database can be considered for any information need in which data points relate to each other and must be managed in a secure, rules- based, consistent way. HADOOP An open-source platform developed by Doug Cutting and Michael J. Cafarella. It has a Java framework and is mostly used for storing and processing big data. Data is stored on low-cost servers clustered together. Distributed file system helps in concurrent processing and fault tolerance SPARK It is an open-source cluster computing designed for fast computation. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. The main feature of Spark is in- memory cluster computing that increases the speed of an application.
  • 6.
    Core Competencies RDBMS Being afacility to process structured data, RDMS provides indexing that facilitates fast data retrieval. The concept of the foreign key is used to share a column among tables increasing relationship maintenance. Multi-user accessibility and virtual table creation are other core competencies. HADOOP High scalability in Hadoop unlike RDBMS makes it a popular big data processing tool Fault tolerance is provided via replication on multiple DataNodes. Can deal with all kinds of data. Data locality that moves processing closer to data makes the processing fast and reduces most SPARK Sprak provided a faster version of data processing than Hadoop with the use of Resilient Distributed dataset (RDD). It provides flexibility to programmers to write applications in their preferred language. Beyond Map Reduce, spark also supports ML algorithms and SQL Queries.
  • 7.
    Functional Differences RDBMS Contains Nnumber of tables, and each table has its own unique primary key. The row is known as a Record that holds information about the individual entry. The column is known as a Field that holds information about a specific field. When the user fires a query, it shows results for specific queries. Important concepts in Relational DBMS, Not Null, Unique, Check, Primary Key, Foreign Key, Data Integrity,. HADOOP Hadoop runs the app using the MapReduce algorithm, processing in parallel on different CPU nodes. The core of Hadoop is built on a storage part called Hadoop Distributed File System and a processing part called the MapReduce programming model. Hadoop splits files into large blocks and distributes them across the clusters, transfer package code into nodes to process data in parallel. SPARK Spark extends the MapReduce model of Hadoop to efficiently use more types of computations which include Interactive Queries and Stream Processing. Spark utilizes Hadoop in two ways – one is storage and the second is processing. Since Spark has its own Cluster Management, it uses Hadoop for storage purposes only. Spark also supports SQL queries, Streaming data, Machine learning, and Graph Algorithms.
  • 8.
    Applications RDBMS Can be consideredfor any information need in which data points relate to each other and must be managed in a secure, rules-based, consistent way. Relational databases are used to Track inventories, Process ecommerce transactions and Manage huge amounts of mission-critical customer information When a customer deposits money at an ATM and then looks at the account balance on a mobile phone, the customer expects to see that deposit reflected immediately in an updated account balance a RDBMS excel at this HADOOP Finance sectors - for fraud detection and prevention. . Security and Law Enforcement - to prevent terrorist attacks and to detect and prevent cyber-attacks. Retail industry - keeping track customer behaviour Real-time analysis of customers data - It can track and process high volumes of clickstream data. Other applications: Advertisements Targeting Platforms, Sentiment analysis, Financial Trading and Forecasting, Healthcare sectors, Optimizing machine performance SPARK Processing Streaming Data - It can efficiently handle unifying disparate data processing capabilities and allow developers to use a single framework to accommodate all their processing requirements. Machine Learning - It is equipped with an integrated framework for performing advanced analytics that allows you to run repeated queries on datasets. Network security - Security providers/companies can inspect data packets for detecting any traces of malicious activity. Fog Computing - As IoT continues to expand, there arises a need for a scalable distributed parallel processing system for processing vast amounts of data.
  • 9.
    Benefits over EachOther RDBMS Commitment and atomicity - Relational databases handle business rules and policies at a very granular level, with strict policies about commitment (that is, making a change to the database permanent). Database locking and concurrency - Conflicts can arise in a database when multiple users or applications attempt to change the same data at the same time. Locking and concurrency techniques reduce the potential for conflicts while maintaining the integrity of the data. HADOOP Data Storage- RDBMS use average size data whereas Hadoop uses large data sets. Speed-In RDBMS reads are fast whereas in Hadoop both reads and writes are fast. Cost - RDBMS requires a costly licence and Spark relies on in-memory computations for real-time data processing (needs high RAM) making Hadoop the cheapest of the three Integrity - RDBMS has higher integrity than Hadoop (this is good as it allows for data processing) SPARK Performance: Spark is faster because it uses random access memory (RAM) instead of reading and writing intermediate data to disks. Hadoop stores data on multiple sources and processes it in batches via MapReduce. Security: Spark enhances security with authentication via shared secret or event logging, whereas Hadoop uses multiple authentication and access control methods. Though, overall, Hadoop is more secure, Spark can integrate with Hadoop to reach a higher security level.
  • 10.
    RDBMS Vs HadoopVs Spark Tabular Data Structure Secure Network SQL support Automated maintenance Multi-user access Authorization control RDBMS - Pros Maintenance cost Field length limits Slower than many other databases High physical memory use Data complexity and data loss risk RDBMS - Cons Resilience - There is always a backup of the data available in the cluster. Scalability - Setup can be easily expanded to include more servers that can store up additional data. Low cost Speed Data diversity HADOOP - Pros Steep learning curve Different datasets require different approaches Limitations of MapReduce Data security HADOOP - Cons No File Management System Few Algorithms Small Files Issues No Automatic Optimization Process Not a Multi User Environment Friendly SPARK - Cons Dynamic in Nature - around 80 high- level operators Powerful - Low latency in-memory data processing Advanced Analytics Reusable codes for Batch processing, ad-hoc queries Real-time stream processing SPARK - Pros
  • 11.