More Related Content


Comparison - RDBMS vs Hadoop vs Apache

  1. RDBMS Vs Hadoop Vs Spark
  2. Relational Database Management System • An RDBMS, or relational database management system, is software that allows users to update, query, and manage relational databases. Structured Query Language (SQL) is the most common programming language used to access a database. The SQL standard has been modified to allow for the storage, retrieval, and publication of JSON data within a relational database, providing greater flexibility. • The most fundamental RDBMS functions are related to create, read, update, and delete operations, which are referred to collectively as CRUD. They serve as the foundation for a well- organized system that promotes consistent data treatment. RDBMS
  3. Hadoop • Apache Hadoop is a set of open-source software utilities that allows you to solve problems involving massive amounts of data and computation by utilizing a network of many computers. It provides a software framework for distributed big data storage and processing based on the MapReduce programming model. • The core of Apache Hadoop is made up of a storage component known as Hadoop Distributed File System (HDFS) and a processing component that uses the MapReduce programming model. Hadoop divides files into large blocks and distributes them across cluster nodes. It then distributes packaged code to nodes in order for the data to be processed in parallel. This method makes use of data locality. HADOOP
  4. Spark • Apache Spark is a free and open-source unified analytics engine for processing large amounts of data. Spark provides a programming interface for entire clusters with implicit data parallelism and fault tolerance. • Apache Spark necessitates the use of a cluster manager and a distributed storage system. Spark supports standalone (native Spark cluster) cluster management, where you can launch a cluster either manually or using the launch scripts provided by the install package. These daemons can also be run on a single machine for testing), Hadoop YARN, Apache Mesos, or Kubernetes. SPARK
  5. RDBMS HADOOP SPARK RDBMS Vs Hadoop Vs Spark Data Variety Data Storage Used for Average Data sets (in GBs) Used for Large Data sets (TBs and PBs) Used for Large Data sets (TBs and PBs) SQL Language Spark SQL Querying HQL (hive Query Language) Used for structured Data Only Used for Semi Structured, Unstructured and Structured Data Used for Semi Structured, Unstructured and Structured Data
  6. RDBMS Vs Hadoop Vs Spark Schema Required on Write (Static Schema) Required on Read (Dynamic Schema) License Free Cost Speed Reads are Fast Both Reads and Writes are fast More than 100 times faster than Hadoop in some cases Required on Read (Dynamic Schema) RDBMS HADOOP SPARK Free
  7. Works on Relational Tables Works on Key Value Pair Resilient Distributed Datasets (RDDs) RDBMS Vs Hadoop Vs Spark Data Objects Hardware Profile High End Profiles Commodity/ Utility Harware High End Profiles Used Cases OLTP (Online transaction processing) Analytics (Audio, video, logs etc), Data Discovery Streaming Data, Machine Learning, Fog Computing, interactive analyses RDBMS HADOOP SPARK
  8. RDBMS • Maintainability: allows database admins to maintain, control, update data into the database easily • Flexibility: saves a lot of time as updating data in one place is enough • Data Structure: stores data in tabular format, easily understood by users, organized data • Privileges: allows database administrators to control activities over the database • Data Safety: data will be safe when the program crashes by authorization codes, other security layers HADOOP • Scalable: it can store and distribute very large data sets • Cost-Effective: The raw data would be deleted, as it would be too cost-prohibitive to keep • Flexible: easy access to new data sources and tap into different types of data • Fast: unique storage method is based on a distributed file system that basically ‘maps’ data • Resilient to failure: in the event of failure, there is another copy available for use. SPARK • Speed: 100 times faster than Hadoop for large scale data processing • Ease of use: easy to use AAPIs for operating on large datasets • Advanced Analytics: It supports Machine learning (ML), Graph algorithms, Streaming data, SQL queries, etc. • Dynamic: easy to develop parallel applications • Multilingual: supports many languages for code writing such as Python, Java, Scala, etc. • Powerful: can handle many analytics challenges RDBMS Vs Hadoop Vs Spark Benefits
  9. RDBMS • Software is expensive • Complex software refers to expensive hardware and hence increases overall cost to avail the RDBMS service • It requires skilled human resources to implement • Certain applications are slow in processing • It is difficult to recover the lost data HADOOP • Fails when it needs to access the small size file in a large amount • It is a framework in java, which makes it more insecure as it can be easily exploited by any the cyber-criminal • Its efficiency decreases while performing in small data surroundings • It uses Kerberos for security features that are not easy to manage. Storage and network encryption are missing in Kerberos which makes us more concerned about it SPARK • No file management system in Apache Spark, which need to be integrated with other platforms • Doesn’t support real-time data stream processing fully. • Not easy to keep data in memory when we talk about the cost- efficient processing of big data • There is a problem with small files when we use Spark with Hadoop • The latency of Apache Spark is higher which results in lower throughput. RDBMS Vs Hadoop Vs Spark Limitations
  10. Thank You