SlideShare a Scribd company logo
1 of 15
Download to read offline
Comparison
among RDBMS,
Hadoop and
Spark
Intro to RDBMS(Relational Database Management System
Use Cases for RDBMS
Intro to HADOOP
The HADOOP Ecosystem
Use Cases for HADOOP
Intro to SPARK
The SPARK Ecosystem
Use Cases for SPARK
RDBMS vs HADOOP
HADOOP vs SPARK
Misconceptions about HADOOP and SPARK
References
Flow of the Presentation
Introduction
Relational database performance has been
optimised to accommodate the world's most
demanding data-centric services,
incorporating current features such as
caching and in-memory approaches.
1.Performance
Since the 1980s, relational databases have
supported numerous transactions, which
means that the technology is already
integrated into almost every system within a
government or corporation.
3.Integration
The relational database has a long history of
effectively serving the world's greatest
governments and businesses, proving it to be
trustworthy and reliable.
2.Reliability
RDBMS(Relational Database Management System)
RDBMS stands for relational database management system, and it is built on a data model. Tables are used to store information in RDBMS.
Each row of the table represents a record, and each column represents a data property. RDBMS data organization and manipulation
techniques differ from those of other databases. RDBMS ensures the ACID (atomicity, consistency, integrity, and durability) properties
essential for database design. The goal of a relational database management system (RDBMS) is to store, manage, and retrieve data as
rapidly and reliably as feasible.
As enterprise architects and data scientists embrace newer data architectures, they will want to retain investments in the relational DBMS for
reasons such as:
Relational databases have evolved during the
last 30 years of securing the world's most
sensitive data, and they now perform like
security veterans.
4.Security
RDBMS and SQL skills have been created
and perfected for decades by professionals.
5.Skills
Use Cases for RDBMS
OLTP programmes are designed
to handle high-volume
transactions. Relational
databases are ideal for OLTP
systems because they allow users
to insert, update, and delete small
amounts of data, can
accommodate a large number of
users, and can handle frequent
queries and updates with quick
response times.
Online transaction
processing


IoT solutions necessitate speed as
well as the capacity to collect and
interpret data from edge devices,
requiring a lightweight database
solution. Relational databases can
provide the tiny footprint that an
IoT workload requires, as well as
the ability to be embedded into
gateway devices and manage
time series data provided by IoT
sensors.
IoT solutions
Relational databases can be
improved for OLAP (online
analytical processing) in a data
warehousing context, where
historical data is processed for
business information. To facilitate
queries on huge numbers of
records and the flexibility to
summarise the data in numerous
ways, a dimensional method is
used. Data in the data warehouse
is typically derived from a variety
of sources.
Data warehouses
The term Hadoop generally refers to the overall Hadoop ecosystem, which encompasses both the core modules and related sub-modules.
The Hadoop Ecosystem:
HADOOP
Apache Hadoop is an open-source, Java-based software platform that manages information handling and storage for big data applications.
Hadoop works by circulating enormous data sets and analytics jobs across hubs in computing clusters, fragmenting them into smaller
workloads that can be run in parallel. Hadoop can deal with structured and unstructured data and scale up dependably from a solitary server
to a large number of machines.
Core Modules of Hadoop
1.Hadoop Distributed File System (HDFS): Primary data storage
system, manages large data sets running on commodity hardware
and provides high-throughput data access and high fault tolerance.
2.Yet Another Resource Negotiator (YARN): Cluster resource
manager that schedules tasks and allocates resources (e.g., CPU
and memory) to applications.
3.Hadoop MapReduce: Splits big data processing tasks into smaller
ones, distributes the small tasks across different nodes, then runs
each task.
4.Hadoop Common (Hadoop Core): Set of common libraries and
utilities that the other three modules depend on.
Hadoop-related sub-modules
1.Apache Hive: It is data warehouse software that runs on Hadoop
and enables users to work with data in HDFS using a SQL-like
query language called HiveQL
2.Apache Impala: It is the open-source, native analytic database for
Apache Hadoop
3.Apache Pig: It is a tool that is used with Hadoop as an abstraction
over MapReduce to analyze large sets of data represented as data
flows. Pig enables operations like join, filter, sort, load, etc.
4.Apache Zookeeper: It is a centralized service for enabling highly
reliable distributed processing.
Use Cases for HADOOP
Processing big data sets in environments where data size exceeds available memory
Batch processing with tasks that exploit disk read and write operations
Building data analysis infrastructure with a limited budget
Completing jobs that are not time-sensitive
Historical and archive data analysis
Hadoop is most effective for scenarios that involve the following:
1.Hadoop in Finance: Finance and IT are the top users of Apache Hadoop since it helps banks evaluate customers and marketers for
legal systems. Banks create risk models for customer portfolios using a cluster.
2.Hadoop in Healthcare: Healthcare is another major user of Hadoop framework. It helps in curing diseases, predicting and managing
epidemics by tracking large-scale health indexes. The main use of Hadoop in healthcare, though, is keeping track of patient records.
3.Hadoop MapReduce: Mobile companies have billions of customers, and Hadoop framework enables these companies keeping track of
all of them. Call Data Records management, Telecom data equipment servicing, infrastructure planning, network traffic analysis, and
creating new products and services are the primary ways it is used in the telecom industry.
4.Hadoop in Retail: Any large-scale retail company that has transactional data needs data management software. Map reduce can
analyze the previous data from various sources to predict sales and increase profit. It studies a historical transaction and adds it to the
cluster.
Five real-world use cases.
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming entire
clusters with implicit data parallelism and fault tolerance. It’s focused on processing data in parallel across a cluster, but the most significant
difference is that it works in memory.
Spark processes data in RAM using a concept known as an RDD, Resilient Distributed Dataset. Spark can run either in stand-alone mode,
with a Hadoop cluster serving as the data source, or in conjunction with Mesos.
SPARK
Spark Core: Underlying execution
engine that schedules and dispatches
tasks and coordinates input and output
(I/O) operations.
Spark SQL: Gathers information
about structured data to enable
users to optimize structured data
processing.
GraphX: User-friendly computation
engine that enables interactive building,
modification and analysis of scalable,
graph-structured data.
.Spark Streaming and Structured Streaming: Both add stream
processing capabilities. Spark Streaming takes data from
different streaming sources and divides it into micro-batches
for a continuous stream. Structured Streaming, built on Spark
SQL, reduces latency and simplifies programming.
Machine Learning Library (MLlib): A set of machine learning
algorithms for scalability plus tools for feature selection and
building ML pipelines. The primary API for MLlib is DataFrames,
which provides uniformity across different programming
languages like Java, Scala and Python.
Modules of Spark ecosystem
USE CASES FOR SPARK
This Spark Streaming capability enriches live data
by combining it with static data, thus allowing
organizations to conduct more complete real-time
data analysis. Online advertisers use data
enrichment to combine historical customer data
with live customer behavior data and deliver more
personalized and targeted ads in real-time.
Data Enrichment
Spark Streaming allows organizations to detect
and respond quickly to rare or unusual behaviors
(“trigger events”) that could indicate a potentially
serious problem within the system. Financial
institutions use triggers to detect fraudulent
transactions. Hospitals also use triggers to detect
potentially dangerous health changes.
Trigger Event Detection
Spark can perform advanced analytics that helps
users run repeated queries on sets of data—which
essentially amounts to processing machine
learning algorithms(MLlib)can work in areas such
as clustering, classification, and dimensionality
reduction. All this enables Spark to be used for
predictive intelligence, customer segmentation,
and sentiment analysis.
Machine Learning
Apache Spark can perform exploratory queries
without sampling. Spark also interfaces with a
number of development languages including SQL,
R, and Python. By combining Spark with
visualization tools, complex data sets can be
processed and visualized interactively. A new
feature that well enables Structured Streaming
that will give users the ability to perform
interactive queries against live data.
Interactive Analysis
Comparisons
RDBMS vs HADOOP
RDBMS HADOOP
Structured data | SQL Structured , Semi-Structured and Unstructured data | HQL
Reads are fast | Low latency
Both reads and writes are fast | Latency higher then RBMS
High end servers and vertical scaliblty
Utility Hardware and horizontal scaliblity
Data and
Querying
Speed and
Latency
Scalability and
Hardware
Data Integrity High(ACID) Low
CATEGORY
License Free
Cost
OLTP (Online transaction processing) Analitics, Data discovery
Primary use-cases
HADOOP vs SPARK
HADOOP SPARK
Slower performance, uses disks for storage and
depends on disk read and write speed.
Fast in-memory performance with reduced disk reading and
writing operations.
An open-source platform, less expensive to run.
Uses affordable consumer hardware. Easier to
find trained Hadoop professionals.
An open-source platform, but relies on memory for
computation, which considerably increases running costs.
Easily scalable by adding nodes and disks for
storage. Supports tens of thousands of nodes
without a known limit.
A bit more challenging to scale because it relies on RAM for
computations. Supports thousands of nodes in a cluster.
Performance
Cost
Scalability
Security Extremely secure. Supports LDAP, ACLs,
Kerberos, SLAs, etc.
Not secure. By default, the security is turned off. Relies on
integration with Hadoop to achieve the necessary security
level.
CATEGORY
Slower than Spark. Data fragments can be too
large and create bottlenecks. Mahout is the main
library.
Much faster with in-memory processing. Uses MLlib for
computations.
Machine Learning
Uses external solutions. YARN is the most
common option for resource management. Oozie
is available for workflow scheduling.
Has built-in tools for resource allocation, scheduling, and
monitoring.
Scheduling and
Resource Management
MISCONCEPTIONS ABOUT HADOOP AND SPARK
COMMON MISCONCEPTIONS ABOUT HADOOP
Hadoop is cheap: while it's open source and easy to configure, it
can get expensive to keep the server running. Big data can cost up
to $ 5,000 to manage using features such as in-memory
computing and network storage.
Hadoop is a database: Although Hadoop is used to store,
manage, and analyze distributed data, extracting data does not
require any queries, so Hadoop becomes more of a data
warehouse than a database.
Hadoop doesn't help SMEs: “Big Data” is not exclusive to “large
companies”. Hadoop has simple features like Excel reports that
enable smaller businesses to take advantage of its power. A
Hadoop cluster or two can greatly improve the performance of a
small business.
Hadoop is difficult to configure: Although Hadoop is difficult to
administer at higher levels, there are many graphical user
interfaces (GUIs) that make MapReduce programming easier.
COMMON MISCONCEPTIONS ABOUT SPARK
Spark is an in-memory technology: Though Spark effectively
utilizes the least recently used (LRU) algorithm, it is not, itself, a
memory-based technology.
Spark always performs 100x faster than Hadoop: While Spark
can run up to 100 times faster than Hadoop for small workloads,
Apache typically only runs up to three times faster for large loads,
according to Apache
Spark introduces new technologies in data processing: Though
Spark effectively utilizes the LRU algorithm and pipelines data
processing, these capabilities previously existed in massively
parallel processing (MPP) databases. However, what sets Spark
apart from MPP is its open-source orientation.
.
References
https://www.ibm.com/analytics/relational-database
https://www.geeksforgeeks.org/difference-between-hadoop-and-spark/
https://www.geeksforgeeks.org/difference-between-rdbms-and-hadoop/
https://www.differencebetween.com/difference-between-rdbms-and-hadoop/
https://phoenixnap.com/kb/hadoop-vs-spark
https://www.ibm.com/cloud/blog/hadoop-vs-spark
THANK
YOU!

More Related Content

What's hot

Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use casesJoey Echeverria
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Data Con LA
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analyticsjoshwills
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHortonworks
 
Apache hive essentials
Apache hive essentialsApache hive essentials
Apache hive essentialsSteve Tran
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architectureMilos Milovanovic
 
Empowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningEmpowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningDataWorks Summit
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irdatastack
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructuredatastack
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystemJakub Stransky
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn HadoopSilicon Halton
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol HARMAN Services
 

What's hot (20)

What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
 
Apache hive essentials
Apache hive essentialsApache hive essentials
Apache hive essentials
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Empowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningEmpowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine Learning
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.ir
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructure
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn Hadoop
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
 

Similar to RDBMS, Hadoop and Spark: A Comparison

RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkLaxmi8
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...rajeshseo5
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paperSupratim Ray
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeSysfore Technologies
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?IJCSIS Research Publications
 
ManMachine&Mathematics_Arup_Ray_Ext
ManMachine&Mathematics_Arup_Ray_ExtManMachine&Mathematics_Arup_Ray_Ext
ManMachine&Mathematics_Arup_Ray_ExtArup Ray
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureJen Stirrup
 
BigData & Hadoop Ecosystem.pptx
BigData & Hadoop Ecosystem.pptxBigData & Hadoop Ecosystem.pptx
BigData & Hadoop Ecosystem.pptxBibhasDeb1
 

Similar to RDBMS, Hadoop and Spark: A Comparison (20)

finap ppt conference.pptx
finap ppt conference.pptxfinap ppt conference.pptx
finap ppt conference.pptx
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14
IJET-V3I2P14
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
paper
paperpaper
paper
 
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 
ManMachine&Mathematics_Arup_Ray_Ext
ManMachine&Mathematics_Arup_Ray_ExtManMachine&Mathematics_Arup_Ray_Ext
ManMachine&Mathematics_Arup_Ray_Ext
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
BigData & Hadoop Ecosystem.pptx
BigData & Hadoop Ecosystem.pptxBigData & Hadoop Ecosystem.pptx
BigData & Hadoop Ecosystem.pptx
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Big data
Big dataBig data
Big data
 

Recently uploaded

9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 

Recently uploaded (20)

9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 

RDBMS, Hadoop and Spark: A Comparison

  • 2. Intro to RDBMS(Relational Database Management System Use Cases for RDBMS Intro to HADOOP The HADOOP Ecosystem Use Cases for HADOOP Intro to SPARK The SPARK Ecosystem Use Cases for SPARK RDBMS vs HADOOP HADOOP vs SPARK Misconceptions about HADOOP and SPARK References Flow of the Presentation
  • 4. Relational database performance has been optimised to accommodate the world's most demanding data-centric services, incorporating current features such as caching and in-memory approaches. 1.Performance Since the 1980s, relational databases have supported numerous transactions, which means that the technology is already integrated into almost every system within a government or corporation. 3.Integration The relational database has a long history of effectively serving the world's greatest governments and businesses, proving it to be trustworthy and reliable. 2.Reliability RDBMS(Relational Database Management System) RDBMS stands for relational database management system, and it is built on a data model. Tables are used to store information in RDBMS. Each row of the table represents a record, and each column represents a data property. RDBMS data organization and manipulation techniques differ from those of other databases. RDBMS ensures the ACID (atomicity, consistency, integrity, and durability) properties essential for database design. The goal of a relational database management system (RDBMS) is to store, manage, and retrieve data as rapidly and reliably as feasible. As enterprise architects and data scientists embrace newer data architectures, they will want to retain investments in the relational DBMS for reasons such as: Relational databases have evolved during the last 30 years of securing the world's most sensitive data, and they now perform like security veterans. 4.Security RDBMS and SQL skills have been created and perfected for decades by professionals. 5.Skills
  • 5. Use Cases for RDBMS OLTP programmes are designed to handle high-volume transactions. Relational databases are ideal for OLTP systems because they allow users to insert, update, and delete small amounts of data, can accommodate a large number of users, and can handle frequent queries and updates with quick response times. Online transaction processing IoT solutions necessitate speed as well as the capacity to collect and interpret data from edge devices, requiring a lightweight database solution. Relational databases can provide the tiny footprint that an IoT workload requires, as well as the ability to be embedded into gateway devices and manage time series data provided by IoT sensors. IoT solutions Relational databases can be improved for OLAP (online analytical processing) in a data warehousing context, where historical data is processed for business information. To facilitate queries on huge numbers of records and the flexibility to summarise the data in numerous ways, a dimensional method is used. Data in the data warehouse is typically derived from a variety of sources. Data warehouses
  • 6. The term Hadoop generally refers to the overall Hadoop ecosystem, which encompasses both the core modules and related sub-modules. The Hadoop Ecosystem: HADOOP Apache Hadoop is an open-source, Java-based software platform that manages information handling and storage for big data applications. Hadoop works by circulating enormous data sets and analytics jobs across hubs in computing clusters, fragmenting them into smaller workloads that can be run in parallel. Hadoop can deal with structured and unstructured data and scale up dependably from a solitary server to a large number of machines. Core Modules of Hadoop 1.Hadoop Distributed File System (HDFS): Primary data storage system, manages large data sets running on commodity hardware and provides high-throughput data access and high fault tolerance. 2.Yet Another Resource Negotiator (YARN): Cluster resource manager that schedules tasks and allocates resources (e.g., CPU and memory) to applications. 3.Hadoop MapReduce: Splits big data processing tasks into smaller ones, distributes the small tasks across different nodes, then runs each task. 4.Hadoop Common (Hadoop Core): Set of common libraries and utilities that the other three modules depend on. Hadoop-related sub-modules 1.Apache Hive: It is data warehouse software that runs on Hadoop and enables users to work with data in HDFS using a SQL-like query language called HiveQL 2.Apache Impala: It is the open-source, native analytic database for Apache Hadoop 3.Apache Pig: It is a tool that is used with Hadoop as an abstraction over MapReduce to analyze large sets of data represented as data flows. Pig enables operations like join, filter, sort, load, etc. 4.Apache Zookeeper: It is a centralized service for enabling highly reliable distributed processing.
  • 7. Use Cases for HADOOP Processing big data sets in environments where data size exceeds available memory Batch processing with tasks that exploit disk read and write operations Building data analysis infrastructure with a limited budget Completing jobs that are not time-sensitive Historical and archive data analysis Hadoop is most effective for scenarios that involve the following: 1.Hadoop in Finance: Finance and IT are the top users of Apache Hadoop since it helps banks evaluate customers and marketers for legal systems. Banks create risk models for customer portfolios using a cluster. 2.Hadoop in Healthcare: Healthcare is another major user of Hadoop framework. It helps in curing diseases, predicting and managing epidemics by tracking large-scale health indexes. The main use of Hadoop in healthcare, though, is keeping track of patient records. 3.Hadoop MapReduce: Mobile companies have billions of customers, and Hadoop framework enables these companies keeping track of all of them. Call Data Records management, Telecom data equipment servicing, infrastructure planning, network traffic analysis, and creating new products and services are the primary ways it is used in the telecom industry. 4.Hadoop in Retail: Any large-scale retail company that has transactional data needs data management software. Map reduce can analyze the previous data from various sources to predict sales and increase profit. It studies a historical transaction and adds it to the cluster. Five real-world use cases.
  • 8. Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It’s focused on processing data in parallel across a cluster, but the most significant difference is that it works in memory. Spark processes data in RAM using a concept known as an RDD, Resilient Distributed Dataset. Spark can run either in stand-alone mode, with a Hadoop cluster serving as the data source, or in conjunction with Mesos. SPARK Spark Core: Underlying execution engine that schedules and dispatches tasks and coordinates input and output (I/O) operations. Spark SQL: Gathers information about structured data to enable users to optimize structured data processing. GraphX: User-friendly computation engine that enables interactive building, modification and analysis of scalable, graph-structured data. .Spark Streaming and Structured Streaming: Both add stream processing capabilities. Spark Streaming takes data from different streaming sources and divides it into micro-batches for a continuous stream. Structured Streaming, built on Spark SQL, reduces latency and simplifies programming. Machine Learning Library (MLlib): A set of machine learning algorithms for scalability plus tools for feature selection and building ML pipelines. The primary API for MLlib is DataFrames, which provides uniformity across different programming languages like Java, Scala and Python. Modules of Spark ecosystem
  • 9. USE CASES FOR SPARK This Spark Streaming capability enriches live data by combining it with static data, thus allowing organizations to conduct more complete real-time data analysis. Online advertisers use data enrichment to combine historical customer data with live customer behavior data and deliver more personalized and targeted ads in real-time. Data Enrichment Spark Streaming allows organizations to detect and respond quickly to rare or unusual behaviors (“trigger events”) that could indicate a potentially serious problem within the system. Financial institutions use triggers to detect fraudulent transactions. Hospitals also use triggers to detect potentially dangerous health changes. Trigger Event Detection Spark can perform advanced analytics that helps users run repeated queries on sets of data—which essentially amounts to processing machine learning algorithms(MLlib)can work in areas such as clustering, classification, and dimensionality reduction. All this enables Spark to be used for predictive intelligence, customer segmentation, and sentiment analysis. Machine Learning Apache Spark can perform exploratory queries without sampling. Spark also interfaces with a number of development languages including SQL, R, and Python. By combining Spark with visualization tools, complex data sets can be processed and visualized interactively. A new feature that well enables Structured Streaming that will give users the ability to perform interactive queries against live data. Interactive Analysis
  • 11. RDBMS vs HADOOP RDBMS HADOOP Structured data | SQL Structured , Semi-Structured and Unstructured data | HQL Reads are fast | Low latency Both reads and writes are fast | Latency higher then RBMS High end servers and vertical scaliblty Utility Hardware and horizontal scaliblity Data and Querying Speed and Latency Scalability and Hardware Data Integrity High(ACID) Low CATEGORY License Free Cost OLTP (Online transaction processing) Analitics, Data discovery Primary use-cases
  • 12. HADOOP vs SPARK HADOOP SPARK Slower performance, uses disks for storage and depends on disk read and write speed. Fast in-memory performance with reduced disk reading and writing operations. An open-source platform, less expensive to run. Uses affordable consumer hardware. Easier to find trained Hadoop professionals. An open-source platform, but relies on memory for computation, which considerably increases running costs. Easily scalable by adding nodes and disks for storage. Supports tens of thousands of nodes without a known limit. A bit more challenging to scale because it relies on RAM for computations. Supports thousands of nodes in a cluster. Performance Cost Scalability Security Extremely secure. Supports LDAP, ACLs, Kerberos, SLAs, etc. Not secure. By default, the security is turned off. Relies on integration with Hadoop to achieve the necessary security level. CATEGORY Slower than Spark. Data fragments can be too large and create bottlenecks. Mahout is the main library. Much faster with in-memory processing. Uses MLlib for computations. Machine Learning Uses external solutions. YARN is the most common option for resource management. Oozie is available for workflow scheduling. Has built-in tools for resource allocation, scheduling, and monitoring. Scheduling and Resource Management
  • 13. MISCONCEPTIONS ABOUT HADOOP AND SPARK COMMON MISCONCEPTIONS ABOUT HADOOP Hadoop is cheap: while it's open source and easy to configure, it can get expensive to keep the server running. Big data can cost up to $ 5,000 to manage using features such as in-memory computing and network storage. Hadoop is a database: Although Hadoop is used to store, manage, and analyze distributed data, extracting data does not require any queries, so Hadoop becomes more of a data warehouse than a database. Hadoop doesn't help SMEs: “Big Data” is not exclusive to “large companies”. Hadoop has simple features like Excel reports that enable smaller businesses to take advantage of its power. A Hadoop cluster or two can greatly improve the performance of a small business. Hadoop is difficult to configure: Although Hadoop is difficult to administer at higher levels, there are many graphical user interfaces (GUIs) that make MapReduce programming easier. COMMON MISCONCEPTIONS ABOUT SPARK Spark is an in-memory technology: Though Spark effectively utilizes the least recently used (LRU) algorithm, it is not, itself, a memory-based technology. Spark always performs 100x faster than Hadoop: While Spark can run up to 100 times faster than Hadoop for small workloads, Apache typically only runs up to three times faster for large loads, according to Apache Spark introduces new technologies in data processing: Though Spark effectively utilizes the LRU algorithm and pipelines data processing, these capabilities previously existed in massively parallel processing (MPP) databases. However, what sets Spark apart from MPP is its open-source orientation. .