3. In this session you will learn about:
Big data Architecture
Connecting and extracting data from storage
Traditional Process with bank use case
Hadoop-HDFS Solution
HDFS Working
Learning Objectives
4. In the era of the Internet of Things and Mobility, with a huge volume of data becoming
available at a fast velocity, there must be the need for an efficient Analytics System.
The variety of data is coming from various sources in different formats, such as sensors,
logs, structured data from an RDBMS, etc.
In the past few years, the generation of new data has drastically increased.
More applications are being built, and they are generating more data at a faster rate.
Earlier, Data Storage was costly, and there was an absence of technology which could
process the data in an efficient manner.
Now the storage costs have become cheaper, and the availability of technology to
transform Big Data is a reality.
5. Big Data Architecture & Patterns
Big Data solution can be well understood using Layered Architecture. The Layered
Architecture is divided into different layers where each layer performs a particular function.
6. Data Ingestion Layer
This layer is the first step for the data coming from variable sources to start its journey.
Data here is prioritized and categorized which makes data flow smoothly in further layers.
Data Collector Layer
In this Layer, more focus is on the transportation of data from ingestion layer to rest of
data pipeline. It is the Layer, where components are decoupled so that analytic capabilities
may begin.
Data Processing Layer
In this primary layer, the focus is to specialize in the data pipeline processing system,
or we can say the data we have collected in the previous layer is to be processed in this layer.
7. Data Storage Layer
Storage becomes a challenge when the size of the data you are dealing with,
becomes large. Several possible solutions can rescue from such problems. Finding a storage
solution is very much important when the size of your data becomes large. This layer focuses
on “where to store such large data efficiently.”
Data Query Layer
This is the layer where active analytic processing takes place. Here, the primary focus
is to gather the data value so that they are made to be more helpful for the next layer.
Data Visualization Layer
The visualization, or presentation tier, probably the most prestigious tier, where the
data pipeline users may feel the VALUE of DATA. We need something that will grab people’s
attention, pull them into, make your findings well-understood.
8. Connecting and extracting data from storage
Data extraction is a process that involves the retrieval of data from various sources.
Data Extraction
For example, you might want to perform calculations on the data — such as aggregating
sales data — and store those results in the data warehouse.
If you are extracting the data to store it in a data warehouse, you might want to add
additional metadata or enrich the data with timestamps or geo location data.
Finally, you likely want to combine the data with other data in the target data store.
These processes, collectively, are called ETL, or Extraction, Transformation, and Loading.
Extraction is the first key step in this process.
9. Structured Data
If the data is structured, the data extraction process is generally performed
within the source system. It's common to perform data extraction using one of the following
methods:
Full extraction. Data is completely extracted from the source, and there is no need to
track changes. The logic is simpler, but the system load is greater.
Incremental extraction. Changes in the source data are tracked since the last
successful extraction so that you do not go through the process of extracting all the data
each time there is a change.
The logic for incremental extraction is more complex, but the system load is reduced.
How Is Data Extracted?
10. Unstructured Data
When you work with unstructured data, a large part of your task is to prepare the data in such
a way that it can be extracted.
You'll probably want to clean up "noise" from your data by doing things like removing
whitespace and symbols, removing duplicate results, and determining how to handle
missing values.
14. Service at HOME
Example-ICICI bank use case
CALL LOG FILE
TEXT FILE
CORE BANKING DATA
JSON FILE
CRM DATA
FACEBOOK PAGE
ETL DATA
WAREHOUSE
BI
No public
excess
XML
15. Drawbacks –
Expensive system
Data available in different place also in different format
Not provide scalability
Time Consuming
Run on single machine so their is limitation to data pulled to data warehouse.
None of action happened in real time.
16. Apache Hadoop is the most powerful tool of Big Data.
Hadoop ecosystem revolves around three main components-
• HDFS
• MapReduce
• YARN
Apart from these Hadoop Components, there are
some other Hadoop ecosystem components also, that play an important
role to boost Hadoop functionalities.
Hadoop
17.
18. Hadoop comes in 2005.
Given by dead cutting and mike a fella. Take idea from Google.
Google already doing a lot of distributed computing.
Next they work with Apache which is open source.
So hadoop apache is open source technology.
But if something is really free you have so many drawbacks with it.
Example- Android vs iphone
Hadoop is a platform not a software.
Cloudera is first company which create commercial distribution for Hadoop and related tools.
Hadoop same as in Apache hadoop available in cloudera. But it provide full back support of
installation with bugs solution with paid service.
19. Another companies are
IBM
MAPR
Microsoft
Hadoop is batch processing system, not work in real time.
I have single machine. So how much amount of data it can store.
Motherboard decides how much gb of RAM it can support.
External storage
Network storage
SAND( storage area network) unlimited storage nut no processing
20. HDFS (Hadoop Distributed File System)
A scalable distributed file system for applications dealing with large data sets.
Distributed: runs in a cluster
Scalable: 10Κnodes, 100Κfiles 10PB storage .
Storage space is seamless for the whole cluster .
Files broken into blocks
Typical block size: 128 MB.
Replication: Each block copied to multiple data nodes.
21. Example- we have 4 machine hadoop cluster
So total storage =6 TB
Data Node 1
Data Node 2
Data Node 3
Master machine
Slave Machine
Slave Machine
Slave Machine
2 TB
2 TB
2 TB
Name Node
Hadoop Installed here
22. You can add many slaves machine in Hadoop cluster called scaling out concept.
Cluster means group of machine.
50 Machine Cluster each node provide 256 RAM + 100 TB storage = 50*100= 5 PT storage.
In Hadoop all machine are commodity hardware.
Assembled servers
It will crash also.
Cheaper as compare to servers.
23.
24.
25.
26.
27. Blocks
as we know that the data in HDFS is scattered across the DataNodes as blocks.
Let’s have a look at what is a block and how is it formed?
Blocks are the nothing but the smallest continuous location on your hard drive where data is
stored.
In general, in any of the File System, you store the data as a collection of blocks.
Similarly, HDFS stores each file as blocks which are scattered throughout the Apache
Hadoop cluster.
The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop
1.x) which you can configure as per your requirement.
28. Let’s take an example where I have a file “example.txt” of size 514 MB as shown in above
figure.
Suppose that we are using the default configuration of block size, which is 128 MB. Then,
how many blocks will be created? 5, Right. The first four blocks will be of 128 MB. But, the
last block will be of 2 MB size only.
It is not necessary that in HDFS, each file is stored in exact multiple of the configured
block size (128 MB, 256 MB etc.).