SlideShare a Scribd company logo
1 of 34
Faculty Name: PALLAVI BAGDE
Year/Branch:3rd/CSE
Subject Code:CS-503(A)
Subject Name:Data Analytics
In this session you will learn about:
Big data Architecture
Connecting and extracting data from storage
Traditional Process with bank use case
Hadoop-HDFS Solution
HDFS Working
Learning Objectives
In the era of the Internet of Things and Mobility, with a huge volume of data becoming
available at a fast velocity, there must be the need for an efficient Analytics System.
The variety of data is coming from various sources in different formats, such as sensors,
logs, structured data from an RDBMS, etc.
In the past few years, the generation of new data has drastically increased.
More applications are being built, and they are generating more data at a faster rate.
Earlier, Data Storage was costly, and there was an absence of technology which could
process the data in an efficient manner.
Now the storage costs have become cheaper, and the availability of technology to
transform Big Data is a reality.
Big Data Architecture & Patterns
Big Data solution can be well understood using Layered Architecture. The Layered
Architecture is divided into different layers where each layer performs a particular function.
Data Ingestion Layer
This layer is the first step for the data coming from variable sources to start its journey.
Data here is prioritized and categorized which makes data flow smoothly in further layers.
Data Collector Layer
In this Layer, more focus is on the transportation of data from ingestion layer to rest of
data pipeline. It is the Layer, where components are decoupled so that analytic capabilities
may begin.
Data Processing Layer
In this primary layer, the focus is to specialize in the data pipeline processing system,
or we can say the data we have collected in the previous layer is to be processed in this layer.
Data Storage Layer
Storage becomes a challenge when the size of the data you are dealing with,
becomes large. Several possible solutions can rescue from such problems. Finding a storage
solution is very much important when the size of your data becomes large. This layer focuses
on “where to store such large data efficiently.”
Data Query Layer
This is the layer where active analytic processing takes place. Here, the primary focus
is to gather the data value so that they are made to be more helpful for the next layer.
Data Visualization Layer
The visualization, or presentation tier, probably the most prestigious tier, where the
data pipeline users may feel the VALUE of DATA. We need something that will grab people’s
attention, pull them into, make your findings well-understood.
Connecting and extracting data from storage
Data extraction is a process that involves the retrieval of data from various sources.
Data Extraction
For example, you might want to perform calculations on the data — such as aggregating
sales data — and store those results in the data warehouse.
If you are extracting the data to store it in a data warehouse, you might want to add
additional metadata or enrich the data with timestamps or geo location data.
Finally, you likely want to combine the data with other data in the target data store.
These processes, collectively, are called ETL, or Extraction, Transformation, and Loading.
Extraction is the first key step in this process.
Structured Data
If the data is structured, the data extraction process is generally performed
within the source system. It's common to perform data extraction using one of the following
methods:
Full extraction. Data is completely extracted from the source, and there is no need to
track changes. The logic is simpler, but the system load is greater.
Incremental extraction. Changes in the source data are tracked since the last
successful extraction so that you do not go through the process of extracting all the data
each time there is a change.
The logic for incremental extraction is more complex, but the system load is reduced.
How Is Data Extracted?
Unstructured Data
When you work with unstructured data, a large part of your task is to prepare the data in such
a way that it can be extracted.
You'll probably want to clean up "noise" from your data by doing things like removing
whitespace and symbols, removing duplicate results, and determining how to handle
missing values.
Comparisons of Storage Media
Traditional Approach of Storing and Processing Big data
As data grows
?
Become
challenging
task
Service at HOME
Example-ICICI bank use case
CALL LOG FILE
TEXT FILE
CORE BANKING DATA
JSON FILE
CRM DATA
FACEBOOK PAGE
ETL DATA
WAREHOUSE
BI
No public
excess
XML
Drawbacks –
 Expensive system
 Data available in different place also in different format
 Not provide scalability
 Time Consuming
 Run on single machine so their is limitation to data pulled to data warehouse.
 None of action happened in real time.
 Apache Hadoop is the most powerful tool of Big Data.
 Hadoop ecosystem revolves around three main components-
• HDFS
• MapReduce
• YARN
Apart from these Hadoop Components, there are
some other Hadoop ecosystem components also, that play an important
role to boost Hadoop functionalities.
Hadoop
Hadoop comes in 2005.
Given by dead cutting and mike a fella. Take idea from Google.
Google already doing a lot of distributed computing.
Next they work with Apache which is open source.
So hadoop apache is open source technology.
But if something is really free you have so many drawbacks with it.
Example- Android vs iphone
Hadoop is a platform not a software.
Cloudera is first company which create commercial distribution for Hadoop and related tools.
Hadoop same as in Apache hadoop available in cloudera. But it provide full back support of
installation with bugs solution with paid service.
Another companies are
IBM
MAPR
Microsoft
Hadoop is batch processing system, not work in real time.
I have single machine. So how much amount of data it can store.
Motherboard decides how much gb of RAM it can support.
External storage
Network storage
SAND( storage area network) unlimited storage nut no processing
HDFS (Hadoop Distributed File System)
A scalable distributed file system for applications dealing with large data sets.
Distributed: runs in a cluster
Scalable: 10Κnodes, 100Κfiles 10PB storage .
Storage space is seamless for the whole cluster .
 Files broken into blocks
 Typical block size: 128 MB.
Replication: Each block copied to multiple data nodes.
Example- we have 4 machine hadoop cluster
So total storage =6 TB
Data Node 1
Data Node 2
Data Node 3
Master machine
Slave Machine
Slave Machine
Slave Machine
2 TB
2 TB
2 TB
Name Node
Hadoop Installed here
You can add many slaves machine in Hadoop cluster called scaling out concept.
Cluster means group of machine.
50 Machine Cluster each node provide 256 RAM + 100 TB storage = 50*100= 5 PT storage.
In Hadoop all machine are commodity hardware.
Assembled servers
It will crash also.
Cheaper as compare to servers.
Blocks
as we know that the data in HDFS is scattered across the DataNodes as blocks.
Let’s have a look at what is a block and how is it formed?
Blocks are the nothing but the smallest continuous location on your hard drive where data is
stored.
In general, in any of the File System, you store the data as a collection of blocks.
Similarly, HDFS stores each file as blocks which are scattered throughout the Apache
Hadoop cluster.
The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop
1.x) which you can configure as per your requirement.
Let’s take an example where I have a file “example.txt” of size 514 MB as shown in above
figure.
Suppose that we are using the default configuration of block size, which is 128 MB. Then,
how many blocks will be created? 5, Right. The first four blocks will be of 128 MB. But, the
last block will be of 2 MB size only.
It is not necessary that in HDFS, each file is stored in exact multiple of the configured
block size (128 MB, 256 MB etc.).
Thanks!

More Related Content

Similar to data analytics lecture 3.2.ppt

Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxAnkitChauhan817826
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesDavid Tjahjono,MD,MBA(UK)
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopIJTET Journal
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBhavya Gulati
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questionsKalyan Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopGiovanna Roda
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeSysfore Technologies
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopIOSR Journals
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With HadoopUmair Shafique
 

Similar to data analytics lecture 3.2.ppt (20)

Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Bigdata overview
Bigdata overviewBigdata overview
Bigdata overview
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
G017143640
G017143640G017143640
G017143640
 
Data science unit2
Data science unit2Data science unit2
Data science unit2
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

Recently uploaded

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 

Recently uploaded (20)

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 

data analytics lecture 3.2.ppt

  • 1.
  • 2. Faculty Name: PALLAVI BAGDE Year/Branch:3rd/CSE Subject Code:CS-503(A) Subject Name:Data Analytics
  • 3. In this session you will learn about: Big data Architecture Connecting and extracting data from storage Traditional Process with bank use case Hadoop-HDFS Solution HDFS Working Learning Objectives
  • 4. In the era of the Internet of Things and Mobility, with a huge volume of data becoming available at a fast velocity, there must be the need for an efficient Analytics System. The variety of data is coming from various sources in different formats, such as sensors, logs, structured data from an RDBMS, etc. In the past few years, the generation of new data has drastically increased. More applications are being built, and they are generating more data at a faster rate. Earlier, Data Storage was costly, and there was an absence of technology which could process the data in an efficient manner. Now the storage costs have become cheaper, and the availability of technology to transform Big Data is a reality.
  • 5. Big Data Architecture & Patterns Big Data solution can be well understood using Layered Architecture. The Layered Architecture is divided into different layers where each layer performs a particular function.
  • 6. Data Ingestion Layer This layer is the first step for the data coming from variable sources to start its journey. Data here is prioritized and categorized which makes data flow smoothly in further layers. Data Collector Layer In this Layer, more focus is on the transportation of data from ingestion layer to rest of data pipeline. It is the Layer, where components are decoupled so that analytic capabilities may begin. Data Processing Layer In this primary layer, the focus is to specialize in the data pipeline processing system, or we can say the data we have collected in the previous layer is to be processed in this layer.
  • 7. Data Storage Layer Storage becomes a challenge when the size of the data you are dealing with, becomes large. Several possible solutions can rescue from such problems. Finding a storage solution is very much important when the size of your data becomes large. This layer focuses on “where to store such large data efficiently.” Data Query Layer This is the layer where active analytic processing takes place. Here, the primary focus is to gather the data value so that they are made to be more helpful for the next layer. Data Visualization Layer The visualization, or presentation tier, probably the most prestigious tier, where the data pipeline users may feel the VALUE of DATA. We need something that will grab people’s attention, pull them into, make your findings well-understood.
  • 8. Connecting and extracting data from storage Data extraction is a process that involves the retrieval of data from various sources. Data Extraction For example, you might want to perform calculations on the data — such as aggregating sales data — and store those results in the data warehouse. If you are extracting the data to store it in a data warehouse, you might want to add additional metadata or enrich the data with timestamps or geo location data. Finally, you likely want to combine the data with other data in the target data store. These processes, collectively, are called ETL, or Extraction, Transformation, and Loading. Extraction is the first key step in this process.
  • 9. Structured Data If the data is structured, the data extraction process is generally performed within the source system. It's common to perform data extraction using one of the following methods: Full extraction. Data is completely extracted from the source, and there is no need to track changes. The logic is simpler, but the system load is greater. Incremental extraction. Changes in the source data are tracked since the last successful extraction so that you do not go through the process of extracting all the data each time there is a change. The logic for incremental extraction is more complex, but the system load is reduced. How Is Data Extracted?
  • 10. Unstructured Data When you work with unstructured data, a large part of your task is to prepare the data in such a way that it can be extracted. You'll probably want to clean up "noise" from your data by doing things like removing whitespace and symbols, removing duplicate results, and determining how to handle missing values.
  • 12. Traditional Approach of Storing and Processing Big data
  • 14. Service at HOME Example-ICICI bank use case CALL LOG FILE TEXT FILE CORE BANKING DATA JSON FILE CRM DATA FACEBOOK PAGE ETL DATA WAREHOUSE BI No public excess XML
  • 15. Drawbacks –  Expensive system  Data available in different place also in different format  Not provide scalability  Time Consuming  Run on single machine so their is limitation to data pulled to data warehouse.  None of action happened in real time.
  • 16.  Apache Hadoop is the most powerful tool of Big Data.  Hadoop ecosystem revolves around three main components- • HDFS • MapReduce • YARN Apart from these Hadoop Components, there are some other Hadoop ecosystem components also, that play an important role to boost Hadoop functionalities. Hadoop
  • 17.
  • 18. Hadoop comes in 2005. Given by dead cutting and mike a fella. Take idea from Google. Google already doing a lot of distributed computing. Next they work with Apache which is open source. So hadoop apache is open source technology. But if something is really free you have so many drawbacks with it. Example- Android vs iphone Hadoop is a platform not a software. Cloudera is first company which create commercial distribution for Hadoop and related tools. Hadoop same as in Apache hadoop available in cloudera. But it provide full back support of installation with bugs solution with paid service.
  • 19. Another companies are IBM MAPR Microsoft Hadoop is batch processing system, not work in real time. I have single machine. So how much amount of data it can store. Motherboard decides how much gb of RAM it can support. External storage Network storage SAND( storage area network) unlimited storage nut no processing
  • 20. HDFS (Hadoop Distributed File System) A scalable distributed file system for applications dealing with large data sets. Distributed: runs in a cluster Scalable: 10Κnodes, 100Κfiles 10PB storage . Storage space is seamless for the whole cluster .  Files broken into blocks  Typical block size: 128 MB. Replication: Each block copied to multiple data nodes.
  • 21. Example- we have 4 machine hadoop cluster So total storage =6 TB Data Node 1 Data Node 2 Data Node 3 Master machine Slave Machine Slave Machine Slave Machine 2 TB 2 TB 2 TB Name Node Hadoop Installed here
  • 22. You can add many slaves machine in Hadoop cluster called scaling out concept. Cluster means group of machine. 50 Machine Cluster each node provide 256 RAM + 100 TB storage = 50*100= 5 PT storage. In Hadoop all machine are commodity hardware. Assembled servers It will crash also. Cheaper as compare to servers.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27. Blocks as we know that the data in HDFS is scattered across the DataNodes as blocks. Let’s have a look at what is a block and how is it formed? Blocks are the nothing but the smallest continuous location on your hard drive where data is stored. In general, in any of the File System, you store the data as a collection of blocks. Similarly, HDFS stores each file as blocks which are scattered throughout the Apache Hadoop cluster. The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x) which you can configure as per your requirement.
  • 28. Let’s take an example where I have a file “example.txt” of size 514 MB as shown in above figure. Suppose that we are using the default configuration of block size, which is 128 MB. Then, how many blocks will be created? 5, Right. The first four blocks will be of 128 MB. But, the last block will be of 2 MB size only. It is not necessary that in HDFS, each file is stored in exact multiple of the configured block size (128 MB, 256 MB etc.).
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.