SlideShare a Scribd company logo
Presented by Cuelogic Technologies
BIG DATA
FRAMEWORKS
Introduction
There are 3V’s that are vital for classifying data as Big
Data. These include Volume, Velocityand Veracity.
Volume:
Data volumes it is in terms of terabytes, petabytes and so on.
Velocity:
Velocity is to do with the high speed of data movement like
real-time data streaming at arapid rate in microseconds.
Veracity:
Veracity involves the handling approach for both structured
and unstructured data.
THINKABOUTIT
Implementation of Big Data infrastructure and technology
can be seen in various industries like banking,
retail, insurance, healthcare, media,etc.
Big Data management functions like storage, sorting,
processing and analysis for such colossal volumes cannot be
handled by the existing database systems or technologies.
There are many frameworks presently existing in this space. Some of
the popular ones are Spark, Hadoop, Hive and Storm.
Some score high on utility index like Presto while frameworks like Flink
have great potential.
There are still others which need some mention like the Samza,Impala,
Apache Pig,etc.
Some of these frameworks have been briefly discussed below.
Apache Hadoop
Hadoop is aJava-based platform founded by Mike Cafarella and Doug
Cutting.
This open-source framework provides batch data processing as well
as data storage services across agroup of hardware machines
arranged inclusters.
Hadoop consists of multiple layers like HDFSandYARNthat work
together to carry out data processing.
HDFS(Hadoop Distributed File System) is the hardware layer that
ensures coordination of data replication and storage activities
across various data clusters. In the event of acluster node
failure, real-time can still be made available for processing.
YARN(YetAnother Resource Negotiator) is the layer responsible
for resource management and job scheduling.
MapReduce is the software layer that functions as the batch
processing engine.
Pros Cons
Include cost-effective solution,
high throughput, multi-language
support, compatibilitywith most
emerging technologies inBig Data
services, highscalability, fault
tolerance, better suitedfor R&D,
high availability through excellent
failure handlingmechanism.
Include vulnerability to security
breaches, does not perform in-
memory computation hence
suffers processing overheads,
not suited for stream
processing and real-time
processing, issues in
processing small files in large
numbers.
It is abatch processing framework with enhanced data streaming
processing.
With full in-memory computation and processing optimisation, it
promises alightning fast cluster computing system.
Apache Spark
Spark framework is composed of five layers.
HDFSand HBASE:They form the first layer of data storage
systems.
YARNand Mesos: Theyform the resource management layer.
Core engine: This forms the third layer.
Library: This forms the fourth layer containing Spark SQLfor SQL
queries while stream processing, GraphX and Spark Rutilities for
processing graph data and MLlib for machine learningalgorithms.
Thefifth layer contains an application program interface such as
Java or Scala.
Pros Cons
Include scalability, lightning
processing speeds through
reduced number of I/O operations
to disk, faulttolerance, supports
advanced analytics applications
with superiorAIimplementation
and seamless integrationwith
Hadoop
Include complexity of setup and
implementation, language support
limitation, notagenuine streaming
engine.
Storm
It is an application development platform-independent, can be used
with any programming language and guarantees delivery of data with
the leastlatency.
In Storm architecture, there are 2 nodes
Master Node and Worker/ Supervisor Node. The master node
monitors the failures of machines and is responsible for task
allocation. In case of acluster failure, the task is reassigned to
another one.
Pros Cons
Include ease insetup and
operation, highscalability, good
speed, fault tolerance,support for
awide range of languages
Include compleximplementation,
debugging issues and not very
learner-friendly
Apache Flink, an open-source framework is equally good for both batch
as well as stream data processing.
It is suited for cluster environments. It is based on transformations -
streams concept.
It is also the 4G of Big Data. It is the 100 times faster than
Hadoop - Map Reduce.
Apache Flink
Flink system contains multiple layers
Deploy Layer
Runtime Layer
Library Layer
Pros Cons
Include lowlatency, high
throughput, fault tolerance,
entry byentry processing,
ease ofbatch and stream
data processing,
compatibility withHadoop.
Include few scalabilityissues.
Hive
Apache Hive, designed by Facebook, is an ETL(Extract / Transform/
Load) and data warehousing system. It is built on top of the Hadoop –
HDFSplatform.
Thekey components of the HiveArchitecture include
Deploy Layer
Runtime Layer
Thekey components of the HiveArchitecture include
Hive Clients
Hive Services
Hive Storage andComputing
TheHive engine converts SQL-queries or requests to MapReduce
taskchains. Theengine comprises of,
Parser: It goes through the incoming SQL-requests and sorts
ThemOptimizer: It goes through the sorted requests and optimises
ThemExecutor: It sends tasks to the Map Reduce framework
Pros Cons
Include lowlatency, high
throughput, fault tolerance,
entry byentry processing,
ease ofbatch and stream
data processing,
compatibility withHadoop.
Include few scalabilityissues.
Presto is the open-source distributed SQLtool most suited for smaller
datasets up to 3Tb.Presto engine includes acoordinator and multiple
workers.
When client submits queries, these are parsed, analysed, their
execution planned and distributed for processing among the workers
by the coordinator.
Presto
Pros Cons
Include least query
degradation even inthe event
of increased concurrent
query workload. Ithas aquery
execution rate thatis three
times fasterthan Hive. Ease
in addingimages and
embedding links. Highlyuser-
friendly.
Include reliabilityissues
Impala is an open-source MPP(Massive Parallel Processing) query
engine that runs on multiple systems under aHadoop cluster.
It has been written in C++ and Java.
Impala
It is not coupled with its storage engine. It includes 3 main
components
Impala Daemon (Impalad): It is executed on every
node where Impala isinstalled.
Impala StateStore
Impala MetaStore
Impala has its query language like SQL.
Pros Cons
Include supports in-memory
computation hence accesses
data without movement
directly fromHadoop nodes,
smooth integrationwith BI
tools likeTableau, ZoomData,
etc., supportsawide range of
file formats.
Include no support forserialisation
and deserialization ofdata, inability
to read custom binary files, table
refresh needed for every record
addition.
Contact Us
+1347 374 8437
info@cuelogic.com
https://www.cuelogic.com/
Unit 610, 134 W 29th St,
New York, NY10001
Content Source: CuelogicBlog
Big data frameworks

More Related Content

What's hot

Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Simplilearn
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 

What's hot (20)

6.hive
6.hive6.hive
6.hive
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Introduction to Data Stream Processing
Introduction to Data Stream ProcessingIntroduction to Data Stream Processing
Introduction to Data Stream Processing
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 

Similar to Big data frameworks

Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
Thanh Nguyen
 

Similar to Big data frameworks (20)

RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Big data
Big dataBig data
Big data
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Hadoop
Hadoop Hadoop
Hadoop
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
 
BigData & Hadoop Ecosystem.pptx
BigData & Hadoop Ecosystem.pptxBigData & Hadoop Ecosystem.pptx
BigData & Hadoop Ecosystem.pptx
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 

More from Cuelogic Technologies Pvt. Ltd.

More from Cuelogic Technologies Pvt. Ltd. (6)

Introduction to mongoDB
Introduction to mongoDBIntroduction to mongoDB
Introduction to mongoDB
 
Introduction to google glass and GDK
Introduction to google glass and GDKIntroduction to google glass and GDK
Introduction to google glass and GDK
 
Automation Testing by Selenium Web Driver
Automation Testing by Selenium Web DriverAutomation Testing by Selenium Web Driver
Automation Testing by Selenium Web Driver
 
Trends in mobile applications development
Trends in mobile applications developmentTrends in mobile applications development
Trends in mobile applications development
 
HTML5
HTML5HTML5
HTML5
 
How to begin with Amazon EC2?
How to begin with Amazon EC2?How to begin with Amazon EC2?
How to begin with Amazon EC2?
 

Recently uploaded

How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 

Recently uploaded (20)

First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysis
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 

Big data frameworks

  • 1. Presented by Cuelogic Technologies BIG DATA FRAMEWORKS
  • 2. Introduction There are 3V’s that are vital for classifying data as Big Data. These include Volume, Velocityand Veracity. Volume: Data volumes it is in terms of terabytes, petabytes and so on. Velocity: Velocity is to do with the high speed of data movement like real-time data streaming at arapid rate in microseconds. Veracity: Veracity involves the handling approach for both structured and unstructured data.
  • 3. THINKABOUTIT Implementation of Big Data infrastructure and technology can be seen in various industries like banking, retail, insurance, healthcare, media,etc. Big Data management functions like storage, sorting, processing and analysis for such colossal volumes cannot be handled by the existing database systems or technologies.
  • 4. There are many frameworks presently existing in this space. Some of the popular ones are Spark, Hadoop, Hive and Storm. Some score high on utility index like Presto while frameworks like Flink have great potential. There are still others which need some mention like the Samza,Impala, Apache Pig,etc. Some of these frameworks have been briefly discussed below.
  • 5. Apache Hadoop Hadoop is aJava-based platform founded by Mike Cafarella and Doug Cutting. This open-source framework provides batch data processing as well as data storage services across agroup of hardware machines arranged inclusters. Hadoop consists of multiple layers like HDFSandYARNthat work together to carry out data processing.
  • 6. HDFS(Hadoop Distributed File System) is the hardware layer that ensures coordination of data replication and storage activities across various data clusters. In the event of acluster node failure, real-time can still be made available for processing. YARN(YetAnother Resource Negotiator) is the layer responsible for resource management and job scheduling. MapReduce is the software layer that functions as the batch processing engine.
  • 7. Pros Cons Include cost-effective solution, high throughput, multi-language support, compatibilitywith most emerging technologies inBig Data services, highscalability, fault tolerance, better suitedfor R&D, high availability through excellent failure handlingmechanism. Include vulnerability to security breaches, does not perform in- memory computation hence suffers processing overheads, not suited for stream processing and real-time processing, issues in processing small files in large numbers.
  • 8. It is abatch processing framework with enhanced data streaming processing. With full in-memory computation and processing optimisation, it promises alightning fast cluster computing system. Apache Spark
  • 9. Spark framework is composed of five layers. HDFSand HBASE:They form the first layer of data storage systems. YARNand Mesos: Theyform the resource management layer. Core engine: This forms the third layer. Library: This forms the fourth layer containing Spark SQLfor SQL queries while stream processing, GraphX and Spark Rutilities for processing graph data and MLlib for machine learningalgorithms. Thefifth layer contains an application program interface such as Java or Scala.
  • 10. Pros Cons Include scalability, lightning processing speeds through reduced number of I/O operations to disk, faulttolerance, supports advanced analytics applications with superiorAIimplementation and seamless integrationwith Hadoop Include complexity of setup and implementation, language support limitation, notagenuine streaming engine.
  • 11. Storm It is an application development platform-independent, can be used with any programming language and guarantees delivery of data with the leastlatency. In Storm architecture, there are 2 nodes Master Node and Worker/ Supervisor Node. The master node monitors the failures of machines and is responsible for task allocation. In case of acluster failure, the task is reassigned to another one.
  • 12. Pros Cons Include ease insetup and operation, highscalability, good speed, fault tolerance,support for awide range of languages Include compleximplementation, debugging issues and not very learner-friendly
  • 13. Apache Flink, an open-source framework is equally good for both batch as well as stream data processing. It is suited for cluster environments. It is based on transformations - streams concept. It is also the 4G of Big Data. It is the 100 times faster than Hadoop - Map Reduce. Apache Flink
  • 14. Flink system contains multiple layers Deploy Layer Runtime Layer Library Layer
  • 15. Pros Cons Include lowlatency, high throughput, fault tolerance, entry byentry processing, ease ofbatch and stream data processing, compatibility withHadoop. Include few scalabilityissues.
  • 16. Hive Apache Hive, designed by Facebook, is an ETL(Extract / Transform/ Load) and data warehousing system. It is built on top of the Hadoop – HDFSplatform. Thekey components of the HiveArchitecture include Deploy Layer Runtime Layer
  • 17. Thekey components of the HiveArchitecture include Hive Clients Hive Services Hive Storage andComputing TheHive engine converts SQL-queries or requests to MapReduce taskchains. Theengine comprises of, Parser: It goes through the incoming SQL-requests and sorts ThemOptimizer: It goes through the sorted requests and optimises ThemExecutor: It sends tasks to the Map Reduce framework
  • 18. Pros Cons Include lowlatency, high throughput, fault tolerance, entry byentry processing, ease ofbatch and stream data processing, compatibility withHadoop. Include few scalabilityissues.
  • 19. Presto is the open-source distributed SQLtool most suited for smaller datasets up to 3Tb.Presto engine includes acoordinator and multiple workers. When client submits queries, these are parsed, analysed, their execution planned and distributed for processing among the workers by the coordinator. Presto
  • 20. Pros Cons Include least query degradation even inthe event of increased concurrent query workload. Ithas aquery execution rate thatis three times fasterthan Hive. Ease in addingimages and embedding links. Highlyuser- friendly. Include reliabilityissues
  • 21. Impala is an open-source MPP(Massive Parallel Processing) query engine that runs on multiple systems under aHadoop cluster. It has been written in C++ and Java. Impala
  • 22. It is not coupled with its storage engine. It includes 3 main components Impala Daemon (Impalad): It is executed on every node where Impala isinstalled. Impala StateStore Impala MetaStore Impala has its query language like SQL.
  • 23. Pros Cons Include supports in-memory computation hence accesses data without movement directly fromHadoop nodes, smooth integrationwith BI tools likeTableau, ZoomData, etc., supportsawide range of file formats. Include no support forserialisation and deserialization ofdata, inability to read custom binary files, table refresh needed for every record addition.
  • 24. Contact Us +1347 374 8437 info@cuelogic.com https://www.cuelogic.com/ Unit 610, 134 W 29th St, New York, NY10001 Content Source: CuelogicBlog