Presented by Cuelogic Technologies
BIG DATA
FRAMEWORKS
Introduction
There are 3V’s that are vital for classifying data as Big
Data. These include Volume, Velocityand Veracity.
Volume:
Data volumes it is in terms of terabytes, petabytes and so on.
Velocity:
Velocity is to do with the high speed of data movement like
real-time data streaming at arapid rate in microseconds.
Veracity:
Veracity involves the handling approach for both structured
and unstructured data.
THINKABOUTIT
Implementation of Big Data infrastructure and technology
can be seen in various industries like banking,
retail, insurance, healthcare, media,etc.
Big Data management functions like storage, sorting,
processing and analysis for such colossal volumes cannot be
handled by the existing database systems or technologies.
There are many frameworks presently existing in this space. Some of
the popular ones are Spark, Hadoop, Hive and Storm.
Some score high on utility index like Presto while frameworks like Flink
have great potential.
There are still others which need some mention like the Samza,Impala,
Apache Pig,etc.
Some of these frameworks have been briefly discussed below.
Apache Hadoop
Hadoop is aJava-based platform founded by Mike Cafarella and Doug
Cutting.
This open-source framework provides batch data processing as well
as data storage services across agroup of hardware machines
arranged inclusters.
Hadoop consists of multiple layers like HDFSandYARNthat work
together to carry out data processing.
HDFS(Hadoop Distributed File System) is the hardware layer that
ensures coordination of data replication and storage activities
across various data clusters. In the event of acluster node
failure, real-time can still be made available for processing.
YARN(YetAnother Resource Negotiator) is the layer responsible
for resource management and job scheduling.
MapReduce is the software layer that functions as the batch
processing engine.
Pros Cons
Include cost-effective solution,
high throughput, multi-language
support, compatibilitywith most
emerging technologies inBig Data
services, highscalability, fault
tolerance, better suitedfor R&D,
high availability through excellent
failure handlingmechanism.
Include vulnerability to security
breaches, does not perform in-
memory computation hence
suffers processing overheads,
not suited for stream
processing and real-time
processing, issues in
processing small files in large
numbers.
It is abatch processing framework with enhanced data streaming
processing.
With full in-memory computation and processing optimisation, it
promises alightning fast cluster computing system.
Apache Spark
Spark framework is composed of five layers.
HDFSand HBASE:They form the first layer of data storage
systems.
YARNand Mesos: Theyform the resource management layer.
Core engine: This forms the third layer.
Library: This forms the fourth layer containing Spark SQLfor SQL
queries while stream processing, GraphX and Spark Rutilities for
processing graph data and MLlib for machine learningalgorithms.
Thefifth layer contains an application program interface such as
Java or Scala.
Pros Cons
Include scalability, lightning
processing speeds through
reduced number of I/O operations
to disk, faulttolerance, supports
advanced analytics applications
with superiorAIimplementation
and seamless integrationwith
Hadoop
Include complexity of setup and
implementation, language support
limitation, notagenuine streaming
engine.
Storm
It is an application development platform-independent, can be used
with any programming language and guarantees delivery of data with
the leastlatency.
In Storm architecture, there are 2 nodes
Master Node and Worker/ Supervisor Node. The master node
monitors the failures of machines and is responsible for task
allocation. In case of acluster failure, the task is reassigned to
another one.
Pros Cons
Include ease insetup and
operation, highscalability, good
speed, fault tolerance,support for
awide range of languages
Include compleximplementation,
debugging issues and not very
learner-friendly
Apache Flink, an open-source framework is equally good for both batch
as well as stream data processing.
It is suited for cluster environments. It is based on transformations -
streams concept.
It is also the 4G of Big Data. It is the 100 times faster than
Hadoop - Map Reduce.
Apache Flink
Flink system contains multiple layers
Deploy Layer
Runtime Layer
Library Layer
Pros Cons
Include lowlatency, high
throughput, fault tolerance,
entry byentry processing,
ease ofbatch and stream
data processing,
compatibility withHadoop.
Include few scalabilityissues.
Hive
Apache Hive, designed by Facebook, is an ETL(Extract / Transform/
Load) and data warehousing system. It is built on top of the Hadoop –
HDFSplatform.
Thekey components of the HiveArchitecture include
Deploy Layer
Runtime Layer
Thekey components of the HiveArchitecture include
Hive Clients
Hive Services
Hive Storage andComputing
TheHive engine converts SQL-queries or requests to MapReduce
taskchains. Theengine comprises of,
Parser: It goes through the incoming SQL-requests and sorts
ThemOptimizer: It goes through the sorted requests and optimises
ThemExecutor: It sends tasks to the Map Reduce framework
Pros Cons
Include lowlatency, high
throughput, fault tolerance,
entry byentry processing,
ease ofbatch and stream
data processing,
compatibility withHadoop.
Include few scalabilityissues.
Presto is the open-source distributed SQLtool most suited for smaller
datasets up to 3Tb.Presto engine includes acoordinator and multiple
workers.
When client submits queries, these are parsed, analysed, their
execution planned and distributed for processing among the workers
by the coordinator.
Presto
Pros Cons
Include least query
degradation even inthe event
of increased concurrent
query workload. Ithas aquery
execution rate thatis three
times fasterthan Hive. Ease
in addingimages and
embedding links. Highlyuser-
friendly.
Include reliabilityissues
Impala is an open-source MPP(Massive Parallel Processing) query
engine that runs on multiple systems under aHadoop cluster.
It has been written in C++ and Java.
Impala
It is not coupled with its storage engine. It includes 3 main
components
Impala Daemon (Impalad): It is executed on every
node where Impala isinstalled.
Impala StateStore
Impala MetaStore
Impala has its query language like SQL.
Pros Cons
Include supports in-memory
computation hence accesses
data without movement
directly fromHadoop nodes,
smooth integrationwith BI
tools likeTableau, ZoomData,
etc., supportsawide range of
file formats.
Include no support forserialisation
and deserialization ofdata, inability
to read custom binary files, table
refresh needed for every record
addition.
Contact Us
+1347 374 8437
info@cuelogic.com
https://www.cuelogic.com/
Unit 610, 134 W 29th St,
New York, NY10001
Content Source: CuelogicBlog
Big data frameworks

Big data frameworks

  • 1.
    Presented by CuelogicTechnologies BIG DATA FRAMEWORKS
  • 2.
    Introduction There are 3V’sthat are vital for classifying data as Big Data. These include Volume, Velocityand Veracity. Volume: Data volumes it is in terms of terabytes, petabytes and so on. Velocity: Velocity is to do with the high speed of data movement like real-time data streaming at arapid rate in microseconds. Veracity: Veracity involves the handling approach for both structured and unstructured data.
  • 3.
    THINKABOUTIT Implementation of BigData infrastructure and technology can be seen in various industries like banking, retail, insurance, healthcare, media,etc. Big Data management functions like storage, sorting, processing and analysis for such colossal volumes cannot be handled by the existing database systems or technologies.
  • 4.
    There are manyframeworks presently existing in this space. Some of the popular ones are Spark, Hadoop, Hive and Storm. Some score high on utility index like Presto while frameworks like Flink have great potential. There are still others which need some mention like the Samza,Impala, Apache Pig,etc. Some of these frameworks have been briefly discussed below.
  • 5.
    Apache Hadoop Hadoop isaJava-based platform founded by Mike Cafarella and Doug Cutting. This open-source framework provides batch data processing as well as data storage services across agroup of hardware machines arranged inclusters. Hadoop consists of multiple layers like HDFSandYARNthat work together to carry out data processing.
  • 6.
    HDFS(Hadoop Distributed FileSystem) is the hardware layer that ensures coordination of data replication and storage activities across various data clusters. In the event of acluster node failure, real-time can still be made available for processing. YARN(YetAnother Resource Negotiator) is the layer responsible for resource management and job scheduling. MapReduce is the software layer that functions as the batch processing engine.
  • 7.
    Pros Cons Include cost-effectivesolution, high throughput, multi-language support, compatibilitywith most emerging technologies inBig Data services, highscalability, fault tolerance, better suitedfor R&D, high availability through excellent failure handlingmechanism. Include vulnerability to security breaches, does not perform in- memory computation hence suffers processing overheads, not suited for stream processing and real-time processing, issues in processing small files in large numbers.
  • 8.
    It is abatchprocessing framework with enhanced data streaming processing. With full in-memory computation and processing optimisation, it promises alightning fast cluster computing system. Apache Spark
  • 9.
    Spark framework iscomposed of five layers. HDFSand HBASE:They form the first layer of data storage systems. YARNand Mesos: Theyform the resource management layer. Core engine: This forms the third layer. Library: This forms the fourth layer containing Spark SQLfor SQL queries while stream processing, GraphX and Spark Rutilities for processing graph data and MLlib for machine learningalgorithms. Thefifth layer contains an application program interface such as Java or Scala.
  • 10.
    Pros Cons Include scalability,lightning processing speeds through reduced number of I/O operations to disk, faulttolerance, supports advanced analytics applications with superiorAIimplementation and seamless integrationwith Hadoop Include complexity of setup and implementation, language support limitation, notagenuine streaming engine.
  • 11.
    Storm It is anapplication development platform-independent, can be used with any programming language and guarantees delivery of data with the leastlatency. In Storm architecture, there are 2 nodes Master Node and Worker/ Supervisor Node. The master node monitors the failures of machines and is responsible for task allocation. In case of acluster failure, the task is reassigned to another one.
  • 12.
    Pros Cons Include easeinsetup and operation, highscalability, good speed, fault tolerance,support for awide range of languages Include compleximplementation, debugging issues and not very learner-friendly
  • 13.
    Apache Flink, anopen-source framework is equally good for both batch as well as stream data processing. It is suited for cluster environments. It is based on transformations - streams concept. It is also the 4G of Big Data. It is the 100 times faster than Hadoop - Map Reduce. Apache Flink
  • 14.
    Flink system containsmultiple layers Deploy Layer Runtime Layer Library Layer
  • 15.
    Pros Cons Include lowlatency,high throughput, fault tolerance, entry byentry processing, ease ofbatch and stream data processing, compatibility withHadoop. Include few scalabilityissues.
  • 16.
    Hive Apache Hive, designedby Facebook, is an ETL(Extract / Transform/ Load) and data warehousing system. It is built on top of the Hadoop – HDFSplatform. Thekey components of the HiveArchitecture include Deploy Layer Runtime Layer
  • 17.
    Thekey components ofthe HiveArchitecture include Hive Clients Hive Services Hive Storage andComputing TheHive engine converts SQL-queries or requests to MapReduce taskchains. Theengine comprises of, Parser: It goes through the incoming SQL-requests and sorts ThemOptimizer: It goes through the sorted requests and optimises ThemExecutor: It sends tasks to the Map Reduce framework
  • 18.
    Pros Cons Include lowlatency,high throughput, fault tolerance, entry byentry processing, ease ofbatch and stream data processing, compatibility withHadoop. Include few scalabilityissues.
  • 19.
    Presto is theopen-source distributed SQLtool most suited for smaller datasets up to 3Tb.Presto engine includes acoordinator and multiple workers. When client submits queries, these are parsed, analysed, their execution planned and distributed for processing among the workers by the coordinator. Presto
  • 20.
    Pros Cons Include leastquery degradation even inthe event of increased concurrent query workload. Ithas aquery execution rate thatis three times fasterthan Hive. Ease in addingimages and embedding links. Highlyuser- friendly. Include reliabilityissues
  • 21.
    Impala is anopen-source MPP(Massive Parallel Processing) query engine that runs on multiple systems under aHadoop cluster. It has been written in C++ and Java. Impala
  • 22.
    It is notcoupled with its storage engine. It includes 3 main components Impala Daemon (Impalad): It is executed on every node where Impala isinstalled. Impala StateStore Impala MetaStore Impala has its query language like SQL.
  • 23.
    Pros Cons Include supportsin-memory computation hence accesses data without movement directly fromHadoop nodes, smooth integrationwith BI tools likeTableau, ZoomData, etc., supportsawide range of file formats. Include no support forserialisation and deserialization ofdata, inability to read custom binary files, table refresh needed for every record addition.
  • 24.
    Contact Us +1347 3748437 info@cuelogic.com https://www.cuelogic.com/ Unit 610, 134 W 29th St, New York, NY10001 Content Source: CuelogicBlog