Big Data Architecture
Dr. G. Jasmine Beulah
When Do I Need Big Data Architecture?
For exploiting Big Data one needs Big Data architecture but not
everyone needs one.
Data in the order of 100s of GB does not require any kind of
architecture.
Unless until one does not process data in the order of terabytes or
petabytes consistently and might require scaling up in the future, they
don’t need Big Data architecture.
Additionally, you use Big Data architecture when you want to invest
in a Big Data Project and have multiple sources of Big Data.
Big Data Architecture
A big data architecture is designed to handle the ingestion, processing,
and analysis of data that is too large or complex for traditional
database systems.
Sources Layer
The Big Data sources are the ones that govern the Big Data architecture.
The designing of the architecture depends heavily on the data sources. The
data is arriving from numerous sources that too in different formats.
These include relational databases, company servers and sensors such as
IoT devices, third-party data providers, etc.
This data can be both batch data as well as real-time data.
These sources pile up a huge amount of data in no time.
The Big Data architecture is designed such that it is capable of handling this
data.
Data Ingestion
This is the first layer from which the journey of Big Data arriving
from numerous sources begins.
This layer takes care of categorizing the data for the smooth flow of
data into the further layers of the architecture.
The primary goal of this layer is to furnish trouble-free transportation
of data into the further layers of data architecture.
Generally, Kafka Streams or REST APIs are used for Ingestion.
Storage Layer
• This layer is at the receiving end for the Big Data. It receives data
from the various data sources and stores it in the most appropriate
manner.
• This layer can even change the format of the data according to the
requirements of the system.
• For example, batch processing data is generally stored in a distributed
file storage systems such as HDFS that are capable of storing high
volume data that too in different formats.
• On the other hand, structured data can be stored using RDBMS only. It
all depends on the format of the data and the purpose we need it for.
Analysis Layer
• The only goal of companies employing Big Data is to gain insights
from it and thus make data-driven decisions.
• To empower users to analyze Big Data, the most important layer in the
Big Data architecture is the analysis layer.
• This analysis layer interacts with the storage layer to gain valuable
insights.
• The architecture requires multiple tools to analyze Big Data.
• The structured data is easy to handle whereas some advanced tools are
needed to analyze the unstructured data.
Batch Processing
Since the data is so huge in size, the architecture needs a batch
processing system to filter, aggregate, and process data for advanced
analytics.
These are long-running batch jobs.
This involves reading the data from the storage layer, processing it,
and finally writing the outputs to the new files.
Hadoop is the most commonly used solution for it.
Real-Time Processing
Processing the data arriving in real-time is the hottest trend in the Big
Data world.
The Big Data architecture, therefore, must include a system to capture
and store real-time data.
This can be done by simply ingesting the real-time data into a data
store for processing.
The architecture needs to have a robust system for dealing with real-
time data.
BI Layer
This layer receives the final analysis output and replicates it to the
appropriate output system.
The different types of outputs are for human viewers, applications,
and business processes.
 The whole process of gaining Big Data solutions includes ingesting
data from multiple sources, repeated data processing operations, and
drawing the results into a report or a dashboard.
These reports are then used for making data-driven decisions by the
companies.
The Architecture of Flipkart Data Platform
FPD Ingestion System
 A Big Data Ingestion System is the first place where all the variables
start their journey into the data system.
It is a process that involves the import and storage of data in a
database.
FPD Ingestion System
• This data can either be taken in the form of batches or real-time
streams. Simply speaking, batch consists of a collection of data points
that are grouped in a specific time interval. On the contrary, streaming
data has to deal with a continuous flow of data.
• Batch Data has greater latency than streaming data which is less than
sub-seconds. There are three ways in which ingestion can be
performed –
• Specter – This is a Java library that is used for sending the draft to
Kafka.
• Dart Service – This is a REST service which allows the payload to be
sent over HTTP.
• File Ingestor – With this, we can make use of the CLI tool to dump
data into the HDFS.
Batch Compute
This part of the big data ecosystem is used for computing and
processing data that is present in batches.
Batch Compute is an efficient method for processing large scale data
that is present in the form of transactions that are collected over a
period of time. These batches can be computed at the end of the day
when the data is collected in large volumes, only to be processed once.
This is the time you need to explore Big Data as much as possible.
Streaming Platform
The streaming platforms process the data that is generated in sub-
seconds.
Apache Flink is one of the most popular real-time streaming
platforms that are used to produce fast-paced analytical results.
It provides a distributed, fault-tolerant and scalable data streaming
capabilities that can be used by the industries to process a million
transactions at one time without any latency.
Messaging Queue
A Messaging Queue acts like a buffer or a temporary storage system
for messages when the destination is busy or not connected. The
message can be in the form of a plain message, a byte array consisting
of headers or a prompt that commands the messaging queue to process
a task.
There are two components in the Messaging Queue Architecture –
Producer and Consumer. A Producer generates the messages and
delivers them to the messaging queue. A Consumer is the end
destination of the message where the message is processed.

Big data architecture

  • 1.
    Big Data Architecture Dr.G. Jasmine Beulah
  • 3.
    When Do INeed Big Data Architecture? For exploiting Big Data one needs Big Data architecture but not everyone needs one. Data in the order of 100s of GB does not require any kind of architecture. Unless until one does not process data in the order of terabytes or petabytes consistently and might require scaling up in the future, they don’t need Big Data architecture. Additionally, you use Big Data architecture when you want to invest in a Big Data Project and have multiple sources of Big Data.
  • 4.
    Big Data Architecture Abig data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional database systems.
  • 5.
    Sources Layer The BigData sources are the ones that govern the Big Data architecture. The designing of the architecture depends heavily on the data sources. The data is arriving from numerous sources that too in different formats. These include relational databases, company servers and sensors such as IoT devices, third-party data providers, etc. This data can be both batch data as well as real-time data. These sources pile up a huge amount of data in no time. The Big Data architecture is designed such that it is capable of handling this data.
  • 6.
    Data Ingestion This isthe first layer from which the journey of Big Data arriving from numerous sources begins. This layer takes care of categorizing the data for the smooth flow of data into the further layers of the architecture. The primary goal of this layer is to furnish trouble-free transportation of data into the further layers of data architecture. Generally, Kafka Streams or REST APIs are used for Ingestion.
  • 7.
    Storage Layer • Thislayer is at the receiving end for the Big Data. It receives data from the various data sources and stores it in the most appropriate manner. • This layer can even change the format of the data according to the requirements of the system. • For example, batch processing data is generally stored in a distributed file storage systems such as HDFS that are capable of storing high volume data that too in different formats. • On the other hand, structured data can be stored using RDBMS only. It all depends on the format of the data and the purpose we need it for.
  • 8.
    Analysis Layer • Theonly goal of companies employing Big Data is to gain insights from it and thus make data-driven decisions. • To empower users to analyze Big Data, the most important layer in the Big Data architecture is the analysis layer. • This analysis layer interacts with the storage layer to gain valuable insights. • The architecture requires multiple tools to analyze Big Data. • The structured data is easy to handle whereas some advanced tools are needed to analyze the unstructured data.
  • 9.
    Batch Processing Since thedata is so huge in size, the architecture needs a batch processing system to filter, aggregate, and process data for advanced analytics. These are long-running batch jobs. This involves reading the data from the storage layer, processing it, and finally writing the outputs to the new files. Hadoop is the most commonly used solution for it.
  • 10.
    Real-Time Processing Processing thedata arriving in real-time is the hottest trend in the Big Data world. The Big Data architecture, therefore, must include a system to capture and store real-time data. This can be done by simply ingesting the real-time data into a data store for processing. The architecture needs to have a robust system for dealing with real- time data.
  • 11.
    BI Layer This layerreceives the final analysis output and replicates it to the appropriate output system. The different types of outputs are for human viewers, applications, and business processes.  The whole process of gaining Big Data solutions includes ingesting data from multiple sources, repeated data processing operations, and drawing the results into a report or a dashboard. These reports are then used for making data-driven decisions by the companies.
  • 12.
    The Architecture ofFlipkart Data Platform
  • 13.
    FPD Ingestion System A Big Data Ingestion System is the first place where all the variables start their journey into the data system. It is a process that involves the import and storage of data in a database.
  • 14.
    FPD Ingestion System •This data can either be taken in the form of batches or real-time streams. Simply speaking, batch consists of a collection of data points that are grouped in a specific time interval. On the contrary, streaming data has to deal with a continuous flow of data. • Batch Data has greater latency than streaming data which is less than sub-seconds. There are three ways in which ingestion can be performed – • Specter – This is a Java library that is used for sending the draft to Kafka. • Dart Service – This is a REST service which allows the payload to be sent over HTTP. • File Ingestor – With this, we can make use of the CLI tool to dump data into the HDFS.
  • 15.
    Batch Compute This partof the big data ecosystem is used for computing and processing data that is present in batches. Batch Compute is an efficient method for processing large scale data that is present in the form of transactions that are collected over a period of time. These batches can be computed at the end of the day when the data is collected in large volumes, only to be processed once. This is the time you need to explore Big Data as much as possible.
  • 16.
    Streaming Platform The streamingplatforms process the data that is generated in sub- seconds. Apache Flink is one of the most popular real-time streaming platforms that are used to produce fast-paced analytical results. It provides a distributed, fault-tolerant and scalable data streaming capabilities that can be used by the industries to process a million transactions at one time without any latency.
  • 17.
    Messaging Queue A MessagingQueue acts like a buffer or a temporary storage system for messages when the destination is busy or not connected. The message can be in the form of a plain message, a byte array consisting of headers or a prompt that commands the messaging queue to process a task. There are two components in the Messaging Queue Architecture – Producer and Consumer. A Producer generates the messages and delivers them to the messaging queue. A Consumer is the end destination of the message where the message is processed.