SlideShare a Scribd company logo
Faculty Name: PALLAVI BAGDE
Year/Branch:3rd/CSE
Subject Code:CS-503(A)
Subject Name:Data Analytics
In this session you will learn about:
Big data Architecture
Connecting and extracting data from storage
Traditional Process with bank use case
Hadoop-HDFS Solution
HDFS Working
Learning Objectives
In the era of the Internet of Things and Mobility, with a huge volume of data becoming
available at a fast velocity, there must be the need for an efficient Analytics System.
The variety of data is coming from various sources in different formats, such as sensors,
logs, structured data from an RDBMS, etc.
In the past few years, the generation of new data has drastically increased.
More applications are being built, and they are generating more data at a faster rate.
Earlier, Data Storage was costly, and there was an absence of technology which could
process the data in an efficient manner.
Now the storage costs have become cheaper, and the availability of technology to
transform Big Data is a reality.
Big Data Architecture & Patterns
Big Data solution can be well understood using Layered Architecture. The Layered
Architecture is divided into different layers where each layer performs a particular function.
Data Ingestion Layer
This layer is the first step for the data coming from variable sources to start its journey.
Data here is prioritized and categorized which makes data flow smoothly in further layers.
Data Collector Layer
In this Layer, more focus is on the transportation of data from ingestion layer to rest of
data pipeline. It is the Layer, where components are decoupled so that analytic capabilities
may begin.
Data Processing Layer
In this primary layer, the focus is to specialize in the data pipeline processing system,
or we can say the data we have collected in the previous layer is to be processed in this layer.
Data Storage Layer
Storage becomes a challenge when the size of the data you are dealing with,
becomes large. Several possible solutions can rescue from such problems. Finding a storage
solution is very much important when the size of your data becomes large. This layer focuses
on “where to store such large data efficiently.”
Data Query Layer
This is the layer where active analytic processing takes place. Here, the primary focus
is to gather the data value so that they are made to be more helpful for the next layer.
Data Visualization Layer
The visualization, or presentation tier, probably the most prestigious tier, where the
data pipeline users may feel the VALUE of DATA. We need something that will grab people’s
attention, pull them into, make your findings well-understood.
Connecting and extracting data from storage
Data extraction is a process that involves the retrieval of data from various sources.
Data Extraction
For example, you might want to perform calculations on the data — such as aggregating
sales data — and store those results in the data warehouse.
If you are extracting the data to store it in a data warehouse, you might want to add
additional metadata or enrich the data with timestamps or geo location data.
Finally, you likely want to combine the data with other data in the target data store.
These processes, collectively, are called ETL, or Extraction, Transformation, and Loading.
Extraction is the first key step in this process.
Structured Data
If the data is structured, the data extraction process is generally performed
within the source system. It's common to perform data extraction using one of the following
methods:
Full extraction. Data is completely extracted from the source, and there is no need to
track changes. The logic is simpler, but the system load is greater.
Incremental extraction. Changes in the source data are tracked since the last
successful extraction so that you do not go through the process of extracting all the data
each time there is a change.
The logic for incremental extraction is more complex, but the system load is reduced.
How Is Data Extracted?
Unstructured Data
When you work with unstructured data, a large part of your task is to prepare the data in such
a way that it can be extracted.
You'll probably want to clean up "noise" from your data by doing things like removing
whitespace and symbols, removing duplicate results, and determining how to handle
missing values.
Comparisons of Storage Media
Traditional Approach of Storing and Processing Big data
As data grows
?
Become
challenging
task
Service at HOME
Example-ICICI bank use case
CALL LOG FILE
TEXT FILE
CORE BANKING DATA
JSON FILE
CRM DATA
FACEBOOK PAGE
ETL DATA
WAREHOUSE
BI
No public
excess
XML
Drawbacks –
 Expensive system
 Data available in different place also in different format
 Not provide scalability
 Time Consuming
 Run on single machine so their is limitation to data pulled to data warehouse.
 None of action happened in real time.
 Apache Hadoop is the most powerful tool of Big Data.
 Hadoop ecosystem revolves around three main components-
• HDFS
• MapReduce
• YARN
Apart from these Hadoop Components, there are
some other Hadoop ecosystem components also, that play an important
role to boost Hadoop functionalities.
Hadoop
Hadoop comes in 2005.
Given by dead cutting and mike a fella. Take idea from Google.
Google already doing a lot of distributed computing.
Next they work with Apache which is open source.
So hadoop apache is open source technology.
But if something is really free you have so many drawbacks with it.
Example- Android vs iphone
Hadoop is a platform not a software.
Cloudera is first company which create commercial distribution for Hadoop and related tools.
Hadoop same as in Apache hadoop available in cloudera. But it provide full back support of
installation with bugs solution with paid service.
Another companies are
IBM
MAPR
Microsoft
Hadoop is batch processing system, not work in real time.
I have single machine. So how much amount of data it can store.
Motherboard decides how much gb of RAM it can support.
External storage
Network storage
SAND( storage area network) unlimited storage nut no processing
HDFS (Hadoop Distributed File System)
A scalable distributed file system for applications dealing with large data sets.
Distributed: runs in a cluster
Scalable: 10Κnodes, 100Κfiles 10PB storage .
Storage space is seamless for the whole cluster .
 Files broken into blocks
 Typical block size: 128 MB.
Replication: Each block copied to multiple data nodes.
Example- we have 4 machine hadoop cluster
So total storage =6 TB
Data Node 1
Data Node 2
Data Node 3
Master machine
Slave Machine
Slave Machine
Slave Machine
2 TB
2 TB
2 TB
Name Node
Hadoop Installed here
You can add many slaves machine in Hadoop cluster called scaling out concept.
Cluster means group of machine.
50 Machine Cluster each node provide 256 RAM + 100 TB storage = 50*100= 5 PT storage.
In Hadoop all machine are commodity hardware.
Assembled servers
It will crash also.
Cheaper as compare to servers.
Blocks
as we know that the data in HDFS is scattered across the DataNodes as blocks.
Let’s have a look at what is a block and how is it formed?
Blocks are the nothing but the smallest continuous location on your hard drive where data is
stored.
In general, in any of the File System, you store the data as a collection of blocks.
Similarly, HDFS stores each file as blocks which are scattered throughout the Apache
Hadoop cluster.
The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop
1.x) which you can configure as per your requirement.
Let’s take an example where I have a file “example.txt” of size 514 MB as shown in above
figure.
Suppose that we are using the default configuration of block size, which is 128 MB. Then,
how many blocks will be created? 5, Right. The first four blocks will be of 128 MB. But, the
last block will be of 2 MB size only.
It is not necessary that in HDFS, each file is stored in exact multiple of the configured
block size (128 MB, 256 MB etc.).
Thanks!

More Related Content

Similar to data analytics lecture 3.2.ppt

Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
JanBask Training
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
AnkitChauhan817826
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
MaulikLakhani
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
David Tjahjono,MD,MBA(UK)
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
Shivanee garg
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
IJTET Journal
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
RajatTripathi34
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
Amr Awadallah
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
Kunal Khanna
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
Bhavya Gulati
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
Kalyan Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Giovanna Roda
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
Sysfore Technologies
 
Bigdata overview
Bigdata overviewBigdata overview
Bigdata overview
AllsoftSolutions
 
G017143640
G017143640G017143640
G017143640
IOSR Journals
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
IOSR Journals
 
Data science unit2
Data science unit2Data science unit2
Data science unit2
varshakumar21
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
Umair Shafique
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Subhas Kumar Ghosh
 

Similar to data analytics lecture 3.2.ppt (20)

Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Bigdata overview
Bigdata overviewBigdata overview
Bigdata overview
 
G017143640
G017143640G017143640
G017143640
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
Data science unit2
Data science unit2Data science unit2
Data science unit2
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

Recently uploaded

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 

Recently uploaded (20)

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 

data analytics lecture 3.2.ppt

  • 1.
  • 2. Faculty Name: PALLAVI BAGDE Year/Branch:3rd/CSE Subject Code:CS-503(A) Subject Name:Data Analytics
  • 3. In this session you will learn about: Big data Architecture Connecting and extracting data from storage Traditional Process with bank use case Hadoop-HDFS Solution HDFS Working Learning Objectives
  • 4. In the era of the Internet of Things and Mobility, with a huge volume of data becoming available at a fast velocity, there must be the need for an efficient Analytics System. The variety of data is coming from various sources in different formats, such as sensors, logs, structured data from an RDBMS, etc. In the past few years, the generation of new data has drastically increased. More applications are being built, and they are generating more data at a faster rate. Earlier, Data Storage was costly, and there was an absence of technology which could process the data in an efficient manner. Now the storage costs have become cheaper, and the availability of technology to transform Big Data is a reality.
  • 5. Big Data Architecture & Patterns Big Data solution can be well understood using Layered Architecture. The Layered Architecture is divided into different layers where each layer performs a particular function.
  • 6. Data Ingestion Layer This layer is the first step for the data coming from variable sources to start its journey. Data here is prioritized and categorized which makes data flow smoothly in further layers. Data Collector Layer In this Layer, more focus is on the transportation of data from ingestion layer to rest of data pipeline. It is the Layer, where components are decoupled so that analytic capabilities may begin. Data Processing Layer In this primary layer, the focus is to specialize in the data pipeline processing system, or we can say the data we have collected in the previous layer is to be processed in this layer.
  • 7. Data Storage Layer Storage becomes a challenge when the size of the data you are dealing with, becomes large. Several possible solutions can rescue from such problems. Finding a storage solution is very much important when the size of your data becomes large. This layer focuses on “where to store such large data efficiently.” Data Query Layer This is the layer where active analytic processing takes place. Here, the primary focus is to gather the data value so that they are made to be more helpful for the next layer. Data Visualization Layer The visualization, or presentation tier, probably the most prestigious tier, where the data pipeline users may feel the VALUE of DATA. We need something that will grab people’s attention, pull them into, make your findings well-understood.
  • 8. Connecting and extracting data from storage Data extraction is a process that involves the retrieval of data from various sources. Data Extraction For example, you might want to perform calculations on the data — such as aggregating sales data — and store those results in the data warehouse. If you are extracting the data to store it in a data warehouse, you might want to add additional metadata or enrich the data with timestamps or geo location data. Finally, you likely want to combine the data with other data in the target data store. These processes, collectively, are called ETL, or Extraction, Transformation, and Loading. Extraction is the first key step in this process.
  • 9. Structured Data If the data is structured, the data extraction process is generally performed within the source system. It's common to perform data extraction using one of the following methods: Full extraction. Data is completely extracted from the source, and there is no need to track changes. The logic is simpler, but the system load is greater. Incremental extraction. Changes in the source data are tracked since the last successful extraction so that you do not go through the process of extracting all the data each time there is a change. The logic for incremental extraction is more complex, but the system load is reduced. How Is Data Extracted?
  • 10. Unstructured Data When you work with unstructured data, a large part of your task is to prepare the data in such a way that it can be extracted. You'll probably want to clean up "noise" from your data by doing things like removing whitespace and symbols, removing duplicate results, and determining how to handle missing values.
  • 12. Traditional Approach of Storing and Processing Big data
  • 14. Service at HOME Example-ICICI bank use case CALL LOG FILE TEXT FILE CORE BANKING DATA JSON FILE CRM DATA FACEBOOK PAGE ETL DATA WAREHOUSE BI No public excess XML
  • 15. Drawbacks –  Expensive system  Data available in different place also in different format  Not provide scalability  Time Consuming  Run on single machine so their is limitation to data pulled to data warehouse.  None of action happened in real time.
  • 16.  Apache Hadoop is the most powerful tool of Big Data.  Hadoop ecosystem revolves around three main components- • HDFS • MapReduce • YARN Apart from these Hadoop Components, there are some other Hadoop ecosystem components also, that play an important role to boost Hadoop functionalities. Hadoop
  • 17.
  • 18. Hadoop comes in 2005. Given by dead cutting and mike a fella. Take idea from Google. Google already doing a lot of distributed computing. Next they work with Apache which is open source. So hadoop apache is open source technology. But if something is really free you have so many drawbacks with it. Example- Android vs iphone Hadoop is a platform not a software. Cloudera is first company which create commercial distribution for Hadoop and related tools. Hadoop same as in Apache hadoop available in cloudera. But it provide full back support of installation with bugs solution with paid service.
  • 19. Another companies are IBM MAPR Microsoft Hadoop is batch processing system, not work in real time. I have single machine. So how much amount of data it can store. Motherboard decides how much gb of RAM it can support. External storage Network storage SAND( storage area network) unlimited storage nut no processing
  • 20. HDFS (Hadoop Distributed File System) A scalable distributed file system for applications dealing with large data sets. Distributed: runs in a cluster Scalable: 10Κnodes, 100Κfiles 10PB storage . Storage space is seamless for the whole cluster .  Files broken into blocks  Typical block size: 128 MB. Replication: Each block copied to multiple data nodes.
  • 21. Example- we have 4 machine hadoop cluster So total storage =6 TB Data Node 1 Data Node 2 Data Node 3 Master machine Slave Machine Slave Machine Slave Machine 2 TB 2 TB 2 TB Name Node Hadoop Installed here
  • 22. You can add many slaves machine in Hadoop cluster called scaling out concept. Cluster means group of machine. 50 Machine Cluster each node provide 256 RAM + 100 TB storage = 50*100= 5 PT storage. In Hadoop all machine are commodity hardware. Assembled servers It will crash also. Cheaper as compare to servers.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27. Blocks as we know that the data in HDFS is scattered across the DataNodes as blocks. Let’s have a look at what is a block and how is it formed? Blocks are the nothing but the smallest continuous location on your hard drive where data is stored. In general, in any of the File System, you store the data as a collection of blocks. Similarly, HDFS stores each file as blocks which are scattered throughout the Apache Hadoop cluster. The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x) which you can configure as per your requirement.
  • 28. Let’s take an example where I have a file “example.txt” of size 514 MB as shown in above figure. Suppose that we are using the default configuration of block size, which is 128 MB. Then, how many blocks will be created? 5, Right. The first four blocks will be of 128 MB. But, the last block will be of 2 MB size only. It is not necessary that in HDFS, each file is stored in exact multiple of the configured block size (128 MB, 256 MB etc.).
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.