This document provides an overview and comparison of RDBMS, Hadoop, and Spark. It introduces RDBMS and describes its use cases such as online transaction processing and data warehouses. It then introduces Hadoop and describes its ecosystem including HDFS, YARN, MapReduce, and related sub-modules. Common use cases for Hadoop are also outlined. Spark is then introduced along with its modules like Spark Core, SQL, and MLlib. Use cases for Spark include data enrichment, trigger event detection, and machine learning. The document concludes by comparing RDBMS and Hadoop, as well as Hadoop and Spark, and addressing common misconceptions about Hadoop and Spark.
2. Intro to RDBMS(Relational Database Management System
Use Cases for RDBMS
Intro to HADOOP
The HADOOP Ecosystem
Use Cases for HADOOP
Intro to SPARK
The SPARK Ecosystem
Use Cases for SPARK
RDBMS vs HADOOP
HADOOP vs SPARK
Misconceptions about HADOOP and SPARK
References
Flow of the Presentation
4. Relational database performance has been
optimised to accommodate the world's most
demanding data-centric services,
incorporating current features such as
caching and in-memory approaches.
1.Performance
Since the 1980s, relational databases have
supported numerous transactions, which
means that the technology is already
integrated into almost every system within a
government or corporation.
3.Integration
The relational database has a long history of
effectively serving the world's greatest
governments and businesses, proving it to be
trustworthy and reliable.
2.Reliability
RDBMS(Relational Database Management System)
RDBMS stands for relational database management system, and it is built on a data model. Tables are used to store information in RDBMS.
Each row of the table represents a record, and each column represents a data property. RDBMS data organization and manipulation
techniques differ from those of other databases. RDBMS ensures the ACID (atomicity, consistency, integrity, and durability) properties
essential for database design. The goal of a relational database management system (RDBMS) is to store, manage, and retrieve data as
rapidly and reliably as feasible.
As enterprise architects and data scientists embrace newer data architectures, they will want to retain investments in the relational DBMS for
reasons such as:
Relational databases have evolved during the
last 30 years of securing the world's most
sensitive data, and they now perform like
security veterans.
4.Security
RDBMS and SQL skills have been created
and perfected for decades by professionals.
5.Skills
5. Use Cases for RDBMS
OLTP programmes are designed
to handle high-volume
transactions. Relational
databases are ideal for OLTP
systems because they allow users
to insert, update, and delete small
amounts of data, can
accommodate a large number of
users, and can handle frequent
queries and updates with quick
response times.
Online transaction
processing
IoT solutions necessitate speed as
well as the capacity to collect and
interpret data from edge devices,
requiring a lightweight database
solution. Relational databases can
provide the tiny footprint that an
IoT workload requires, as well as
the ability to be embedded into
gateway devices and manage
time series data provided by IoT
sensors.
IoT solutions
Relational databases can be
improved for OLAP (online
analytical processing) in a data
warehousing context, where
historical data is processed for
business information. To facilitate
queries on huge numbers of
records and the flexibility to
summarise the data in numerous
ways, a dimensional method is
used. Data in the data warehouse
is typically derived from a variety
of sources.
Data warehouses
6. The term Hadoop generally refers to the overall Hadoop ecosystem, which encompasses both the core modules and related sub-modules.
The Hadoop Ecosystem:
HADOOP
Apache Hadoop is an open-source, Java-based software platform that manages information handling and storage for big data applications.
Hadoop works by circulating enormous data sets and analytics jobs across hubs in computing clusters, fragmenting them into smaller
workloads that can be run in parallel. Hadoop can deal with structured and unstructured data and scale up dependably from a solitary server
to a large number of machines.
Core Modules of Hadoop
1.Hadoop Distributed File System (HDFS): Primary data storage
system, manages large data sets running on commodity hardware
and provides high-throughput data access and high fault tolerance.
2.Yet Another Resource Negotiator (YARN): Cluster resource
manager that schedules tasks and allocates resources (e.g., CPU
and memory) to applications.
3.Hadoop MapReduce: Splits big data processing tasks into smaller
ones, distributes the small tasks across different nodes, then runs
each task.
4.Hadoop Common (Hadoop Core): Set of common libraries and
utilities that the other three modules depend on.
Hadoop-related sub-modules
1.Apache Hive: It is data warehouse software that runs on Hadoop
and enables users to work with data in HDFS using a SQL-like
query language called HiveQL
2.Apache Impala: It is the open-source, native analytic database for
Apache Hadoop
3.Apache Pig: It is a tool that is used with Hadoop as an abstraction
over MapReduce to analyze large sets of data represented as data
flows. Pig enables operations like join, filter, sort, load, etc.
4.Apache Zookeeper: It is a centralized service for enabling highly
reliable distributed processing.
7. Use Cases for HADOOP
Processing big data sets in environments where data size exceeds available memory
Batch processing with tasks that exploit disk read and write operations
Building data analysis infrastructure with a limited budget
Completing jobs that are not time-sensitive
Historical and archive data analysis
Hadoop is most effective for scenarios that involve the following:
1.Hadoop in Finance: Finance and IT are the top users of Apache Hadoop since it helps banks evaluate customers and marketers for
legal systems. Banks create risk models for customer portfolios using a cluster.
2.Hadoop in Healthcare: Healthcare is another major user of Hadoop framework. It helps in curing diseases, predicting and managing
epidemics by tracking large-scale health indexes. The main use of Hadoop in healthcare, though, is keeping track of patient records.
3.Hadoop MapReduce: Mobile companies have billions of customers, and Hadoop framework enables these companies keeping track of
all of them. Call Data Records management, Telecom data equipment servicing, infrastructure planning, network traffic analysis, and
creating new products and services are the primary ways it is used in the telecom industry.
4.Hadoop in Retail: Any large-scale retail company that has transactional data needs data management software. Map reduce can
analyze the previous data from various sources to predict sales and increase profit. It studies a historical transaction and adds it to the
cluster.
Five real-world use cases.
8. Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming entire
clusters with implicit data parallelism and fault tolerance. It’s focused on processing data in parallel across a cluster, but the most significant
difference is that it works in memory.
Spark processes data in RAM using a concept known as an RDD, Resilient Distributed Dataset. Spark can run either in stand-alone mode,
with a Hadoop cluster serving as the data source, or in conjunction with Mesos.
SPARK
Spark Core: Underlying execution
engine that schedules and dispatches
tasks and coordinates input and output
(I/O) operations.
Spark SQL: Gathers information
about structured data to enable
users to optimize structured data
processing.
GraphX: User-friendly computation
engine that enables interactive building,
modification and analysis of scalable,
graph-structured data.
.Spark Streaming and Structured Streaming: Both add stream
processing capabilities. Spark Streaming takes data from
different streaming sources and divides it into micro-batches
for a continuous stream. Structured Streaming, built on Spark
SQL, reduces latency and simplifies programming.
Machine Learning Library (MLlib): A set of machine learning
algorithms for scalability plus tools for feature selection and
building ML pipelines. The primary API for MLlib is DataFrames,
which provides uniformity across different programming
languages like Java, Scala and Python.
Modules of Spark ecosystem
9. USE CASES FOR SPARK
This Spark Streaming capability enriches live data
by combining it with static data, thus allowing
organizations to conduct more complete real-time
data analysis. Online advertisers use data
enrichment to combine historical customer data
with live customer behavior data and deliver more
personalized and targeted ads in real-time.
Data Enrichment
Spark Streaming allows organizations to detect
and respond quickly to rare or unusual behaviors
(“trigger events”) that could indicate a potentially
serious problem within the system. Financial
institutions use triggers to detect fraudulent
transactions. Hospitals also use triggers to detect
potentially dangerous health changes.
Trigger Event Detection
Spark can perform advanced analytics that helps
users run repeated queries on sets of data—which
essentially amounts to processing machine
learning algorithms(MLlib)can work in areas such
as clustering, classification, and dimensionality
reduction. All this enables Spark to be used for
predictive intelligence, customer segmentation,
and sentiment analysis.
Machine Learning
Apache Spark can perform exploratory queries
without sampling. Spark also interfaces with a
number of development languages including SQL,
R, and Python. By combining Spark with
visualization tools, complex data sets can be
processed and visualized interactively. A new
feature that well enables Structured Streaming
that will give users the ability to perform
interactive queries against live data.
Interactive Analysis
11. RDBMS vs HADOOP
RDBMS HADOOP
Structured data | SQL Structured , Semi-Structured and Unstructured data | HQL
Reads are fast | Low latency
Both reads and writes are fast | Latency higher then RBMS
High end servers and vertical scaliblty
Utility Hardware and horizontal scaliblity
Data and
Querying
Speed and
Latency
Scalability and
Hardware
Data Integrity High(ACID) Low
CATEGORY
License Free
Cost
OLTP (Online transaction processing) Analitics, Data discovery
Primary use-cases
12. HADOOP vs SPARK
HADOOP SPARK
Slower performance, uses disks for storage and
depends on disk read and write speed.
Fast in-memory performance with reduced disk reading and
writing operations.
An open-source platform, less expensive to run.
Uses affordable consumer hardware. Easier to
find trained Hadoop professionals.
An open-source platform, but relies on memory for
computation, which considerably increases running costs.
Easily scalable by adding nodes and disks for
storage. Supports tens of thousands of nodes
without a known limit.
A bit more challenging to scale because it relies on RAM for
computations. Supports thousands of nodes in a cluster.
Performance
Cost
Scalability
Security Extremely secure. Supports LDAP, ACLs,
Kerberos, SLAs, etc.
Not secure. By default, the security is turned off. Relies on
integration with Hadoop to achieve the necessary security
level.
CATEGORY
Slower than Spark. Data fragments can be too
large and create bottlenecks. Mahout is the main
library.
Much faster with in-memory processing. Uses MLlib for
computations.
Machine Learning
Uses external solutions. YARN is the most
common option for resource management. Oozie
is available for workflow scheduling.
Has built-in tools for resource allocation, scheduling, and
monitoring.
Scheduling and
Resource Management
13. MISCONCEPTIONS ABOUT HADOOP AND SPARK
COMMON MISCONCEPTIONS ABOUT HADOOP
Hadoop is cheap: while it's open source and easy to configure, it
can get expensive to keep the server running. Big data can cost up
to $ 5,000 to manage using features such as in-memory
computing and network storage.
Hadoop is a database: Although Hadoop is used to store,
manage, and analyze distributed data, extracting data does not
require any queries, so Hadoop becomes more of a data
warehouse than a database.
Hadoop doesn't help SMEs: “Big Data” is not exclusive to “large
companies”. Hadoop has simple features like Excel reports that
enable smaller businesses to take advantage of its power. A
Hadoop cluster or two can greatly improve the performance of a
small business.
Hadoop is difficult to configure: Although Hadoop is difficult to
administer at higher levels, there are many graphical user
interfaces (GUIs) that make MapReduce programming easier.
COMMON MISCONCEPTIONS ABOUT SPARK
Spark is an in-memory technology: Though Spark effectively
utilizes the least recently used (LRU) algorithm, it is not, itself, a
memory-based technology.
Spark always performs 100x faster than Hadoop: While Spark
can run up to 100 times faster than Hadoop for small workloads,
Apache typically only runs up to three times faster for large loads,
according to Apache
Spark introduces new technologies in data processing: Though
Spark effectively utilizes the LRU algorithm and pipelines data
processing, these capabilities previously existed in massively
parallel processing (MPP) databases. However, what sets Spark
apart from MPP is its open-source orientation.
.