3. When Do I Need Big Data Architecture?
For exploiting Big Data one needs Big Data architecture but not
everyone needs one.
Data in the order of 100s of GB does not require any kind of
architecture.
Unless until one does not process data in the order of terabytes or
petabytes consistently and might require scaling up in the future, they
don’t need Big Data architecture.
Additionally, you use Big Data architecture when you want to invest
in a Big Data Project and have multiple sources of Big Data.
4. Big Data Architecture
A big data architecture is designed to handle the ingestion, processing,
and analysis of data that is too large or complex for traditional
database systems.
5. Sources Layer
The Big Data sources are the ones that govern the Big Data architecture.
The designing of the architecture depends heavily on the data sources. The
data is arriving from numerous sources that too in different formats.
These include relational databases, company servers and sensors such as
IoT devices, third-party data providers, etc.
This data can be both batch data as well as real-time data.
These sources pile up a huge amount of data in no time.
The Big Data architecture is designed such that it is capable of handling this
data.
6. Data Ingestion
This is the first layer from which the journey of Big Data arriving
from numerous sources begins.
This layer takes care of categorizing the data for the smooth flow of
data into the further layers of the architecture.
The primary goal of this layer is to furnish trouble-free transportation
of data into the further layers of data architecture.
Generally, Kafka Streams or REST APIs are used for Ingestion.
7. Storage Layer
• This layer is at the receiving end for the Big Data. It receives data
from the various data sources and stores it in the most appropriate
manner.
• This layer can even change the format of the data according to the
requirements of the system.
• For example, batch processing data is generally stored in a distributed
file storage systems such as HDFS that are capable of storing high
volume data that too in different formats.
• On the other hand, structured data can be stored using RDBMS only. It
all depends on the format of the data and the purpose we need it for.
8. Analysis Layer
• The only goal of companies employing Big Data is to gain insights
from it and thus make data-driven decisions.
• To empower users to analyze Big Data, the most important layer in the
Big Data architecture is the analysis layer.
• This analysis layer interacts with the storage layer to gain valuable
insights.
• The architecture requires multiple tools to analyze Big Data.
• The structured data is easy to handle whereas some advanced tools are
needed to analyze the unstructured data.
9. Batch Processing
Since the data is so huge in size, the architecture needs a batch
processing system to filter, aggregate, and process data for advanced
analytics.
These are long-running batch jobs.
This involves reading the data from the storage layer, processing it,
and finally writing the outputs to the new files.
Hadoop is the most commonly used solution for it.
10. Real-Time Processing
Processing the data arriving in real-time is the hottest trend in the Big
Data world.
The Big Data architecture, therefore, must include a system to capture
and store real-time data.
This can be done by simply ingesting the real-time data into a data
store for processing.
The architecture needs to have a robust system for dealing with real-
time data.
11. BI Layer
This layer receives the final analysis output and replicates it to the
appropriate output system.
The different types of outputs are for human viewers, applications,
and business processes.
The whole process of gaining Big Data solutions includes ingesting
data from multiple sources, repeated data processing operations, and
drawing the results into a report or a dashboard.
These reports are then used for making data-driven decisions by the
companies.
13. FPD Ingestion System
A Big Data Ingestion System is the first place where all the variables
start their journey into the data system.
It is a process that involves the import and storage of data in a
database.
14. FPD Ingestion System
• This data can either be taken in the form of batches or real-time
streams. Simply speaking, batch consists of a collection of data points
that are grouped in a specific time interval. On the contrary, streaming
data has to deal with a continuous flow of data.
• Batch Data has greater latency than streaming data which is less than
sub-seconds. There are three ways in which ingestion can be
performed –
• Specter – This is a Java library that is used for sending the draft to
Kafka.
• Dart Service – This is a REST service which allows the payload to be
sent over HTTP.
• File Ingestor – With this, we can make use of the CLI tool to dump
data into the HDFS.
15. Batch Compute
This part of the big data ecosystem is used for computing and
processing data that is present in batches.
Batch Compute is an efficient method for processing large scale data
that is present in the form of transactions that are collected over a
period of time. These batches can be computed at the end of the day
when the data is collected in large volumes, only to be processed once.
This is the time you need to explore Big Data as much as possible.
16. Streaming Platform
The streaming platforms process the data that is generated in sub-
seconds.
Apache Flink is one of the most popular real-time streaming
platforms that are used to produce fast-paced analytical results.
It provides a distributed, fault-tolerant and scalable data streaming
capabilities that can be used by the industries to process a million
transactions at one time without any latency.
17. Messaging Queue
A Messaging Queue acts like a buffer or a temporary storage system
for messages when the destination is busy or not connected. The
message can be in the form of a plain message, a byte array consisting
of headers or a prompt that commands the messaging queue to process
a task.
There are two components in the Messaging Queue Architecture –
Producer and Consumer. A Producer generates the messages and
delivers them to the messaging queue. A Consumer is the end
destination of the message where the message is processed.