SlideShare a Scribd company logo
Big Data Analytics
Dr. Danish Mahmood
Big Data Platforms
• Hadoop
• Architecture
• Storage
• Resource navigator
• Computations
• Ecosystems
• HBASE
• HIVE
• ZOOKEEPER
• MOSES
• … Etc.
• Spark
• Architecture
• Concept of RDDs
• Spark Streaming
• Spark Mlib
• Spark SQL
• Eco systems
In This presentation
Hadoop Architecture, Storage,
Resource Navigator, and
Computations
Yet to come
Introduction
• In the “distributed data” world, the terms Spark, Hadoop, and Kafka should sound familiar.
• However, with numerous big data solutions available,
• it may be unclear exactly what they are, their main differences, and which is better.
• Determine
• what kinds of applications, such as machine learning, distributed streaming, and data storage
that you can expect to make effective and efficient by using Hadoop, Spark, and Kafka.
What is Hadoop?
Hadoop is an open-source software that stores massive amounts of data while running large numbers of
commodity-grade computers to tackle tasks that are too large for a single computer to process on its own.
Store and compute:
can be used to write software that stores data or runs computations across hundreds or thousands of
machines without needing to know the details of what each machine can do, or how it can communicate.
Error Handling:
designed to handle them within the framework itself, which significantly reduces the amount of error
handling necessary within your solution.
Key components
At its most basic level of functionality, Hadoop includes the standard libraries between its modules, the file
system HDFS (Hadoop Distributed File System), YARN (Yet Another Resource Negotiator), and its
implementation of MapReduce.
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets
across clusters of computers using simple programming models. It is designed to scale up from single servers to
thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver
high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a
highly-available service on top of a cluster of computers, each of which may be prone to failures.
The project includes these modules:
• Hadoop Common: The common utilities that support the other Hadoop modules.
• Hadoop Distributed File System (HDFS™): A distributed file system that provides high- throughput access to
application data.
• Hadoop YARN: A framework for job scheduling and cluster resource management.
• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Hadoop-related Apache Projects
• Ambari™: A web-based tool for provisioning, managing, and monitoring Hadoop clusters. It also
provides a dashboard for viewing cluster health and ability to view MapReduce, Pig and Hive
applications visually.
• Avro™: A data serialization system.
• Cassandra™: A scalable multi-master database with no single points of failure.
• Chukwa™: A data collection system for managing large distributed systems.
• HBase™: A scalable, distributed database that supports structured data storage for large tables.
• Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
• Mahout™: A Scalable machine learning and data mining library.
Hadoop-related Apache Projects
• Pig™: A high-level data-flow language and execution framework for parallel computation.
• Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and
expressive programming model that supports a wide range of applications, including ETL,
machine learning, stream processing, and graph computation.
• Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a
powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch
and interactive use-cases.
• ZooKeeper™: A high-performance coordination service for distributed applications.
Distinctive Layers of Hadoop
YARN
Distinctive Layers of Hadoop
HDFS
Common Use Cases for Big Data in Hadoop
• Log Data Analysis
most common, fits perfectly for HDFS scenario: Write once & Read often.
• Data Warehouse Modernization
• Fraud Detection
• Risk Modeling
• Social Sentiment Analysis
• Image Classification
• Graph Analysis
• Beyond
Data Storage Operations on HDFS
• Hadoop is designed to work best with a modest number of
extremely large files.
• Average file sizes ➔ larger than 500MB.
• Write Once, Read Often model.
• Content of individual files cannot be modified, other than
appending new data at the end of the file.
What we can do:
• Create a new file
• Append content to the end of a file
• Delete a file
• Rename a file
• Modify file attributes like owner
HDFS Deamons
NameNode
Keeps the metadata of all files/blocks in the file system, and tracks where across the cluster the file
data is kept.
It does not store the data of these files itself. Kind of block lookup dictionary (index or address book
of blocks).
Client applications talk to the NameNode whenever they wish to locate a file, or when they want to
add/copy/move/delete a file.
The NameNode responds the successful requests by returning a list of relevant DataNode servers
where the data lives.
HDFS Deamons
DataNode
DataNode stores data in the Hadoop Filesystem
A functional filesystem has more than one DataNode, with data replicated across them
On startup, a DataNode connects to the NameNode; spinning until that service comes up.
It then responds to requests from the NameNode for filesystem operations.
Client applications can talk directly to a DataNode, once the NameNode has provided the location of
the data
HDFS Deamons
Secondry NameNode
Not a failover NameNode
The only purpose of the secondary name-node is to perform periodic checkpoints. The secondary
name-node periodically
downloads current name-node image and edits log files, joins them into new image and uploads
the new image back to the (primary and the only) name-node
Default checkpoint time is one hour. It can be set to one minute on highly busy clusters where lots
of write operations are being performed.
HDFS blocks
• File is divided into blocks (default: 64MB) and duplicated in multiple places. 128MB in Hadoop
2.0.
• Dividing into blocks is normal for a file system. E.g., the default block size in Linux is 4KB.
• The difference of HDFS is the scale. Hadoop was designed to operate at the petabyte scale.
• Every data block stored in HDFS has its own metadata and needs to be tracked by a central server
HDFS Blocks
When HDFS stores the replicas of the original blocks across the
Hadoop cluster, it tries to ensure that the block replicas are stored
in different failure points.
Name Node
Data Node
• Data Replication:
• HDFS is designed to handle large scale data in distributed environment
• Hardware or software failure, or network partition exist
• Therefore need replications for those fault tolerance
• Replication placement:
• High initialization time to create replication to all machines
• An approximate solution: Only 3 replications
One replication resides in current node
One replication resides in current rack
One replication resides in another rack
Data Replication
Data Replication
Rack Awareness
Data Replication
Re-replicating missing replicas
• Missing heartbeats signify lost nodes
• Name node consults metadata, finds affected data
• Name node consults rack awareness script
• Name node tells a data node to re replicate
Name Node Failure
• NameNode is the single point of failure in the cluster- Hadoop 1.0
• If NameNode is down due to software glitch, restart the machine
• If original NameNode can be restored, secondary can re-establish the most current metadata
snapshot
• If machine don’t come up, metadata for the cluster is irretrievable. In this situation create a new
NameNode, use secondary to copy metadata to new primary, restart whole cluster
• Before Hadoop 2.0, NameNode was a single point of failure and operation limitation.
• Before Hadoop 2, Hadoop clusters usually have fewer clusters that were able to scale beyond 3,000 or
4,000 nodes.
• Multiple NameNodes can be used in Hadoop 2.x. (HDFS High Availability feature – one is in an Active
state, the other one is in a Standby state).
High Availability of the NameNodes
Standby NameNode – keeping the state of the block
locations and block metadata in memory
JournalNode – if a failure occurs, the Standby Node
reads all completed journal entries to
ensure the new Active NameNode is fully consistent
with the state of cluster.
Zookeeper – provides coordination and
configuration services for distributed systems.
Several useful commands for HDFS
All hadoop commands are invoked by the bin/hadoop script.
% hadoop fsck / -files –blocks:
➔ list the blocks that make up each file in HDFS.
For HDFS, the schema name is hdfs, and for the local file system, the schema name is file.
A file or director in HDFS can be specified in a fully qualified way, such as:
hdfs://namenodehost/parent/child or hdfs://namenodehost
The HDFS file system shell command is similar to Linux file commands, with the following general
syntax: hadoop hdfs –file_cmd
For instance mkdir runs as:
$hadoop hdfs dfs –mkdir /user/directory_name
Several
useful
commands
for
HDFS
Calculating HDFS nodes storage
• Key players in computing HDFS node storage
• H = HDFS Storage size
• C = Compression ratio. It depends on the type of compression used and size of the data. When no
compression is used, C=1.
• R = Replication factor. It is usually 3 in a production cluster.
• S = Initial size of data need to be moved to Hadoop. This could be a combination of historical data and
incremental data.
• i = intermediate data factor. It is usually 1/3 or 1/4. It is Hadoop’s Intermediate working space dedicated
to storing intermediate results of Map Tasks are any temporary storage used in Pig or Hive. This is a
common guidlines for many production applications. Even Cloudera has recommended 25% for
intermediate results.
Calculating Initial Data
• This could be a combination of historical data and incremental data.
• In this, we need to consider the growth rate of Initial Data as well, at least for next 3-6 months
period,
• Like we have 500 TB data now, and it is expected that 50 TB will be ingested in next three
months, and Output files from MR Jobs may create at least 10 % of the initial data, then we
need to consider 600 TB as the initial data size.
• i.e., 500 TB + 50 TB +500*10/100 = 600 TB initial size
• Now if we have nodes having size of 8TBs, How many nodes will be needed. Number of data
nodes (n): n = H/d (d= disk space available per node.) = 600/8 (without considering replication
and intermediate data factors along with compression techniques that may be employed)
• Question: Is it feasible to use 100% disk space?
Estimating size for Hadoop storage based on
initial data
• Suppose you have to upload X GBs of data into HDFS (Hadoop 2.0). with no compression,
a Replication factor of 3, and Intermediate factor of 0.25 = ¼. Compute how many times Hadoop’s
storage will be increased with respect to initial data i.e., X GBs.
• H = (3+1/4)*X = 3.25*X
With the assumptions above, the Hadoop storage is estimated to be 3.25 times the size of the
initial data size.
H = HDFS Storage size
C = Compression ratio. When no compression is used, C=1.
R = Replication factor. It is usually 3 in a production cluster.
S = Initial size of data need to be moved to Hadoop. This could be a combination of historical data and incremental data.
i = intermediate data factor. It is usually 1/3 or 1/4. when no information is given assume it as zero.
If 8TB is the available disk space per node (10 disks with 1 TB, 2 disk for operating system etc. were
excluded.). Assuming initial data size is 600 TB. How will you estimate the number of data nodes (n)?
• Estimating the hardware requirement is always challenging in Hadoop environment because we never know
when data storage demand can increase for a business.
• We must understand following factors in detail to come to a conclusion for the current scenario of adding right
numbers to the cluster:
• The actual size of data to store – 600 TB
• At what pace the data will increase in the future (per day/week/month/quarter/year) – Data trending
analysis or business requirement justification (prediction)
• We are in Hadoop world, so replication factor plays an important role – default 3x replicas
• Hardware machine overhead (OS, logs etc.) – 2 disks were considered
• Intermediate mapper and reducer data output on hard disk - 1x
• Space utilization between 60 % to 70 % - Finally, as a perfect designer we never want our hard drive to be
full with their storage capacity.
• Compression ratio
Calculation to find the number of data nodes
required to store 600 TB of data
• Rough Calculation
• Data Size – 600 TB
• Replication factor – 3
• Intermediate data – 1
• Total Storage requirement – (3+1) * 600 = 2400 TB
• Available disk size for storage – 8 TB
• Total number of required data nodes (approx.): n = H/d
• 2400/8 = 300 machines
The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper
nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator.
The intermediate data is cleaned up after the Hadoop Job complete
Calculation to find the number of data nodes
required to store 600 TB of data
• Actual Calculation:
• Disk space utilization – 65 % (differ business to business)
• Compression ratio – 2.3
• Total Storage requirement – 2400/2.3 = 1043.5 TB
• Available disk size for storage – 8*0.65 = 5.2 TB
• Total number of required data nodes (approx.): 1043.5/5.2 = 201 machines
Case: Business has predicted 20 % data increase in a quarter
and we need to predict the new machines to be added in a
year.
• Data increase – 20 % over a quarter
• 1st quarter: 1043.5 * 0.2 = 208.7 TB
• 2nd quarter: 1043.5 * 1.2 * 0.2 =
250.44 TB
• 3rd quarter: 1043.5 * (1.2)^2 * 0.2 =
300.5 TB
• 4th quarter: 1043.5 * (1.2)^3 * 0.2 =
360.6 TB
• Additional data nodes requirement
(approx.):
• 1st quarter: 208.7/5.2 = 41 machines
• 2nd quarter: 250.44/5.2 = 49 machines
• 3rd quarter: 300.5/5.2 = 58 machines
• 4th quarter: 360.6/5.2 = 70 machines
Compound Interest Formula: A = P (1 + R/100)ⁿ * percentage of increase
Here, P = 1043.5, R = 20 % Quarterly and n = increment after every quarter.
Thought Question
• Imagine that you are uploading a file of 1664MB into HDFS (Hadoop 2.0).
• 8 blocks are successfully uploaded into HDFS Please find how many blocks are remaining.
• Another client wants to work or read the uploaded data while the upload is still in progress
i.e., data which is updated in 8 blocks. What will happen in such a scenario, will the 8 blocks
of data that is uploaded will it be displayed or available for use?

More Related Content

Similar to Topic 9a-Hadoop Storage- HDFS.pptx

Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
vinayiqbusiness
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
Brian Enochson
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataCyanny LIANG
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
Jay Nagar
 
Hadoop
HadoopHadoop
Hadoop
chandinisanz
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
AmirReza Mohammadi
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
maharajothip1
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
chunkypandey12
 
Cppt
CpptCppt
Cppt
CpptCppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptHADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
ManiMaran230751
 

Similar to Topic 9a-Hadoop Storage- HDFS.pptx (20)

Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Anju
AnjuAnju
Anju
 
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptHADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
 

More from DanishMahmood23

Topic 4- processes.pptx
Topic 4- processes.pptxTopic 4- processes.pptx
Topic 4- processes.pptx
DanishMahmood23
 
Topic 5- Communications v1.pptx
Topic 5- Communications v1.pptxTopic 5- Communications v1.pptx
Topic 5- Communications v1.pptx
DanishMahmood23
 
L1-intro(2).pptx
L1-intro(2).pptxL1-intro(2).pptx
L1-intro(2).pptx
DanishMahmood23
 
IoT_IO1_3 Getting familiar with Hardware - Sensors.pdf
IoT_IO1_3 Getting familiar with Hardware - Sensors.pdfIoT_IO1_3 Getting familiar with Hardware - Sensors.pdf
IoT_IO1_3 Getting familiar with Hardware - Sensors.pdf
DanishMahmood23
 
IoT_IO1_2 Getting familiar with Hardware - Development Boards.pdf
IoT_IO1_2 Getting familiar with Hardware - Development Boards.pdfIoT_IO1_2 Getting familiar with Hardware - Development Boards.pdf
IoT_IO1_2 Getting familiar with Hardware - Development Boards.pdf
DanishMahmood23
 
10. Lec X- SDN.pptx
10. Lec X- SDN.pptx10. Lec X- SDN.pptx
10. Lec X- SDN.pptx
DanishMahmood23
 
IoT_IO1_1 Introduction to the IoT-1.pdf
IoT_IO1_1 Introduction to the IoT-1.pdfIoT_IO1_1 Introduction to the IoT-1.pdf
IoT_IO1_1 Introduction to the IoT-1.pdf
DanishMahmood23
 
IoT architecture.pptx
IoT architecture.pptxIoT architecture.pptx
IoT architecture.pptx
DanishMahmood23
 

More from DanishMahmood23 (8)

Topic 4- processes.pptx
Topic 4- processes.pptxTopic 4- processes.pptx
Topic 4- processes.pptx
 
Topic 5- Communications v1.pptx
Topic 5- Communications v1.pptxTopic 5- Communications v1.pptx
Topic 5- Communications v1.pptx
 
L1-intro(2).pptx
L1-intro(2).pptxL1-intro(2).pptx
L1-intro(2).pptx
 
IoT_IO1_3 Getting familiar with Hardware - Sensors.pdf
IoT_IO1_3 Getting familiar with Hardware - Sensors.pdfIoT_IO1_3 Getting familiar with Hardware - Sensors.pdf
IoT_IO1_3 Getting familiar with Hardware - Sensors.pdf
 
IoT_IO1_2 Getting familiar with Hardware - Development Boards.pdf
IoT_IO1_2 Getting familiar with Hardware - Development Boards.pdfIoT_IO1_2 Getting familiar with Hardware - Development Boards.pdf
IoT_IO1_2 Getting familiar with Hardware - Development Boards.pdf
 
10. Lec X- SDN.pptx
10. Lec X- SDN.pptx10. Lec X- SDN.pptx
10. Lec X- SDN.pptx
 
IoT_IO1_1 Introduction to the IoT-1.pdf
IoT_IO1_1 Introduction to the IoT-1.pdfIoT_IO1_1 Introduction to the IoT-1.pdf
IoT_IO1_1 Introduction to the IoT-1.pdf
 
IoT architecture.pptx
IoT architecture.pptxIoT architecture.pptx
IoT architecture.pptx
 

Recently uploaded

Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
Kamal Acharya
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
abh.arya
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
Kamal Acharya
 
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSETECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
DuvanRamosGarzon1
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 

Recently uploaded (20)

Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
 
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSETECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 

Topic 9a-Hadoop Storage- HDFS.pptx

  • 1. Big Data Analytics Dr. Danish Mahmood
  • 2. Big Data Platforms • Hadoop • Architecture • Storage • Resource navigator • Computations • Ecosystems • HBASE • HIVE • ZOOKEEPER • MOSES • … Etc. • Spark • Architecture • Concept of RDDs • Spark Streaming • Spark Mlib • Spark SQL • Eco systems In This presentation Hadoop Architecture, Storage, Resource Navigator, and Computations Yet to come
  • 3. Introduction • In the “distributed data” world, the terms Spark, Hadoop, and Kafka should sound familiar. • However, with numerous big data solutions available, • it may be unclear exactly what they are, their main differences, and which is better. • Determine • what kinds of applications, such as machine learning, distributed streaming, and data storage that you can expect to make effective and efficient by using Hadoop, Spark, and Kafka.
  • 4. What is Hadoop? Hadoop is an open-source software that stores massive amounts of data while running large numbers of commodity-grade computers to tackle tasks that are too large for a single computer to process on its own. Store and compute: can be used to write software that stores data or runs computations across hundreds or thousands of machines without needing to know the details of what each machine can do, or how it can communicate. Error Handling: designed to handle them within the framework itself, which significantly reduces the amount of error handling necessary within your solution. Key components At its most basic level of functionality, Hadoop includes the standard libraries between its modules, the file system HDFS (Hadoop Distributed File System), YARN (Yet Another Resource Negotiator), and its implementation of MapReduce.
  • 5. The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The project includes these modules: • Hadoop Common: The common utilities that support the other Hadoop modules. • Hadoop Distributed File System (HDFS™): A distributed file system that provides high- throughput access to application data. • Hadoop YARN: A framework for job scheduling and cluster resource management. • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
  • 6. Hadoop-related Apache Projects • Ambari™: A web-based tool for provisioning, managing, and monitoring Hadoop clusters. It also provides a dashboard for viewing cluster health and ability to view MapReduce, Pig and Hive applications visually. • Avro™: A data serialization system. • Cassandra™: A scalable multi-master database with no single points of failure. • Chukwa™: A data collection system for managing large distributed systems. • HBase™: A scalable, distributed database that supports structured data storage for large tables. • Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. • Mahout™: A Scalable machine learning and data mining library.
  • 7. Hadoop-related Apache Projects • Pig™: A high-level data-flow language and execution framework for parallel computation. • Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. • Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. • ZooKeeper™: A high-performance coordination service for distributed applications.
  • 8. Distinctive Layers of Hadoop YARN
  • 10. HDFS
  • 11. Common Use Cases for Big Data in Hadoop • Log Data Analysis most common, fits perfectly for HDFS scenario: Write once & Read often. • Data Warehouse Modernization • Fraud Detection • Risk Modeling • Social Sentiment Analysis • Image Classification • Graph Analysis • Beyond
  • 12.
  • 13. Data Storage Operations on HDFS • Hadoop is designed to work best with a modest number of extremely large files. • Average file sizes ➔ larger than 500MB. • Write Once, Read Often model. • Content of individual files cannot be modified, other than appending new data at the end of the file. What we can do: • Create a new file • Append content to the end of a file • Delete a file • Rename a file • Modify file attributes like owner
  • 14.
  • 15. HDFS Deamons NameNode Keeps the metadata of all files/blocks in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. Kind of block lookup dictionary (index or address book of blocks). Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.
  • 16. HDFS Deamons DataNode DataNode stores data in the Hadoop Filesystem A functional filesystem has more than one DataNode, with data replicated across them On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data
  • 17. HDFS Deamons Secondry NameNode Not a failover NameNode The only purpose of the secondary name-node is to perform periodic checkpoints. The secondary name-node periodically downloads current name-node image and edits log files, joins them into new image and uploads the new image back to the (primary and the only) name-node Default checkpoint time is one hour. It can be set to one minute on highly busy clusters where lots of write operations are being performed.
  • 18. HDFS blocks • File is divided into blocks (default: 64MB) and duplicated in multiple places. 128MB in Hadoop 2.0. • Dividing into blocks is normal for a file system. E.g., the default block size in Linux is 4KB. • The difference of HDFS is the scale. Hadoop was designed to operate at the petabyte scale. • Every data block stored in HDFS has its own metadata and needs to be tracked by a central server
  • 19. HDFS Blocks When HDFS stores the replicas of the original blocks across the Hadoop cluster, it tries to ensure that the block replicas are stored in different failure points.
  • 21. Data Node • Data Replication: • HDFS is designed to handle large scale data in distributed environment • Hardware or software failure, or network partition exist • Therefore need replications for those fault tolerance • Replication placement: • High initialization time to create replication to all machines • An approximate solution: Only 3 replications One replication resides in current node One replication resides in current rack One replication resides in another rack
  • 24. Data Replication Re-replicating missing replicas • Missing heartbeats signify lost nodes • Name node consults metadata, finds affected data • Name node consults rack awareness script • Name node tells a data node to re replicate
  • 25. Name Node Failure • NameNode is the single point of failure in the cluster- Hadoop 1.0 • If NameNode is down due to software glitch, restart the machine • If original NameNode can be restored, secondary can re-establish the most current metadata snapshot • If machine don’t come up, metadata for the cluster is irretrievable. In this situation create a new NameNode, use secondary to copy metadata to new primary, restart whole cluster
  • 26. • Before Hadoop 2.0, NameNode was a single point of failure and operation limitation. • Before Hadoop 2, Hadoop clusters usually have fewer clusters that were able to scale beyond 3,000 or 4,000 nodes. • Multiple NameNodes can be used in Hadoop 2.x. (HDFS High Availability feature – one is in an Active state, the other one is in a Standby state).
  • 27. High Availability of the NameNodes Standby NameNode – keeping the state of the block locations and block metadata in memory JournalNode – if a failure occurs, the Standby Node reads all completed journal entries to ensure the new Active NameNode is fully consistent with the state of cluster. Zookeeper – provides coordination and configuration services for distributed systems.
  • 28. Several useful commands for HDFS All hadoop commands are invoked by the bin/hadoop script. % hadoop fsck / -files –blocks: ➔ list the blocks that make up each file in HDFS. For HDFS, the schema name is hdfs, and for the local file system, the schema name is file. A file or director in HDFS can be specified in a fully qualified way, such as: hdfs://namenodehost/parent/child or hdfs://namenodehost The HDFS file system shell command is similar to Linux file commands, with the following general syntax: hadoop hdfs –file_cmd For instance mkdir runs as: $hadoop hdfs dfs –mkdir /user/directory_name
  • 30. Calculating HDFS nodes storage • Key players in computing HDFS node storage • H = HDFS Storage size • C = Compression ratio. It depends on the type of compression used and size of the data. When no compression is used, C=1. • R = Replication factor. It is usually 3 in a production cluster. • S = Initial size of data need to be moved to Hadoop. This could be a combination of historical data and incremental data. • i = intermediate data factor. It is usually 1/3 or 1/4. It is Hadoop’s Intermediate working space dedicated to storing intermediate results of Map Tasks are any temporary storage used in Pig or Hive. This is a common guidlines for many production applications. Even Cloudera has recommended 25% for intermediate results.
  • 31. Calculating Initial Data • This could be a combination of historical data and incremental data. • In this, we need to consider the growth rate of Initial Data as well, at least for next 3-6 months period, • Like we have 500 TB data now, and it is expected that 50 TB will be ingested in next three months, and Output files from MR Jobs may create at least 10 % of the initial data, then we need to consider 600 TB as the initial data size. • i.e., 500 TB + 50 TB +500*10/100 = 600 TB initial size • Now if we have nodes having size of 8TBs, How many nodes will be needed. Number of data nodes (n): n = H/d (d= disk space available per node.) = 600/8 (without considering replication and intermediate data factors along with compression techniques that may be employed) • Question: Is it feasible to use 100% disk space?
  • 32. Estimating size for Hadoop storage based on initial data • Suppose you have to upload X GBs of data into HDFS (Hadoop 2.0). with no compression, a Replication factor of 3, and Intermediate factor of 0.25 = ¼. Compute how many times Hadoop’s storage will be increased with respect to initial data i.e., X GBs. • H = (3+1/4)*X = 3.25*X With the assumptions above, the Hadoop storage is estimated to be 3.25 times the size of the initial data size. H = HDFS Storage size C = Compression ratio. When no compression is used, C=1. R = Replication factor. It is usually 3 in a production cluster. S = Initial size of data need to be moved to Hadoop. This could be a combination of historical data and incremental data. i = intermediate data factor. It is usually 1/3 or 1/4. when no information is given assume it as zero.
  • 33. If 8TB is the available disk space per node (10 disks with 1 TB, 2 disk for operating system etc. were excluded.). Assuming initial data size is 600 TB. How will you estimate the number of data nodes (n)? • Estimating the hardware requirement is always challenging in Hadoop environment because we never know when data storage demand can increase for a business. • We must understand following factors in detail to come to a conclusion for the current scenario of adding right numbers to the cluster: • The actual size of data to store – 600 TB • At what pace the data will increase in the future (per day/week/month/quarter/year) – Data trending analysis or business requirement justification (prediction) • We are in Hadoop world, so replication factor plays an important role – default 3x replicas • Hardware machine overhead (OS, logs etc.) – 2 disks were considered • Intermediate mapper and reducer data output on hard disk - 1x • Space utilization between 60 % to 70 % - Finally, as a perfect designer we never want our hard drive to be full with their storage capacity. • Compression ratio
  • 34. Calculation to find the number of data nodes required to store 600 TB of data • Rough Calculation • Data Size – 600 TB • Replication factor – 3 • Intermediate data – 1 • Total Storage requirement – (3+1) * 600 = 2400 TB • Available disk size for storage – 8 TB • Total number of required data nodes (approx.): n = H/d • 2400/8 = 300 machines The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job complete
  • 35. Calculation to find the number of data nodes required to store 600 TB of data • Actual Calculation: • Disk space utilization – 65 % (differ business to business) • Compression ratio – 2.3 • Total Storage requirement – 2400/2.3 = 1043.5 TB • Available disk size for storage – 8*0.65 = 5.2 TB • Total number of required data nodes (approx.): 1043.5/5.2 = 201 machines
  • 36. Case: Business has predicted 20 % data increase in a quarter and we need to predict the new machines to be added in a year. • Data increase – 20 % over a quarter • 1st quarter: 1043.5 * 0.2 = 208.7 TB • 2nd quarter: 1043.5 * 1.2 * 0.2 = 250.44 TB • 3rd quarter: 1043.5 * (1.2)^2 * 0.2 = 300.5 TB • 4th quarter: 1043.5 * (1.2)^3 * 0.2 = 360.6 TB • Additional data nodes requirement (approx.): • 1st quarter: 208.7/5.2 = 41 machines • 2nd quarter: 250.44/5.2 = 49 machines • 3rd quarter: 300.5/5.2 = 58 machines • 4th quarter: 360.6/5.2 = 70 machines Compound Interest Formula: A = P (1 + R/100)ⁿ * percentage of increase Here, P = 1043.5, R = 20 % Quarterly and n = increment after every quarter.
  • 37. Thought Question • Imagine that you are uploading a file of 1664MB into HDFS (Hadoop 2.0). • 8 blocks are successfully uploaded into HDFS Please find how many blocks are remaining. • Another client wants to work or read the uploaded data while the upload is still in progress i.e., data which is updated in 8 blocks. What will happen in such a scenario, will the 8 blocks of data that is uploaded will it be displayed or available for use?