SlideShare a Scribd company logo
CONFIGURATION PARAMETERS
DFS.BLOCK.SIZE

Author

Amit Anand

Date Created

7/15/2013

1
About Me
I am an Oracle Certified Database Administrator and Cloudera Certified Apache Hadoop Administrator. I can be
contacted at anandkamith@gmail.com

Introduction
I have read at many places (blogs, books etc.) about the precedence order of HADOOP configuration files and how the
configuration parameter “dfs.block.size” is used if defined at multiple levels with different values. These levels are defined
as:
 Master – When properties are defined at name node / master node level
 Slave – When properties are defined at data node / slave node level
 Client – There are two type of commands that can be submitted by client
o A client utility like “hadoop fs –put”
o A MapReduce job submitted by the client
I always wanted to see the precedence order in action and hence decide to play with it a little and note down my findings
that also encouraged me to write this document. I will try to explain this to the best of my knowledge.

When files are created using MapReduce
The parameter “dfs.block.size” is defined in “hdfs-site.xml” and can have different values between name node and data
nodes.
Remember that “dfs.block.size” is client specific and has no effect on NN or DN. The only time NN or DN configuration
comes into play is when files are created using MapReduce.
Example 1: In hdfs-site.xml – Used by MapReduce. Client uses this only if it is defined
<configuration>
<property>
<name>dfs.block.size</name>
<value>67108864</value>
</property>
</configuration>

Defined as 64MB

Hadoop environment used
For the purpose of this document I am using Cloudera distribution cdh3u6 on Centos 6.4 with Java 1.6 update 26

Scenarios
Now let's go through each case scenario where the configuration files have different values between master/slave and see
the impact of it on the files that are created in HDFS.

2
Scenario 1: Configuration file has same value on master and slave
Hdfs-site.xml on master node

Hdfs-site.xml on slave node

<configuration>
<property>
<name>dfs.block.size</name>
<value>67108864</value>
</property>
</configuration>

<configuration>
<property>
<name>dfs.block.size</name>
<value>67108864</value>
</property>
</configuration>

Outcome: All the files will be created with 64MB of block size.
Scenario 2: Configuration file has different value on master node
Hdfs-site.xml on master node

Hdfs-site.xml on slave node

<configuration>
<property>
<name>dfs.block.size</name>
<value>134217728</value>
</property>
</configuration>

<configuration>
<property>
<name>dfs.block.size</name>
<value>67108864</value>
</property>
</configuration>

Outcome: All the files will be created with 128MB of block size as defined by hdfs-site.xml on the master node. Master
node has higher precedence than the slave node.
Scenario 3: Configuration file has different value on slave node
Hdfs-site.xml on master node

Hdfs-site.xml on slave node

<configuration>
<property>
<name>dfs.block.size</name>
<value>67108864</value>
</property>
</configuration>

<configuration>
<property>
<name>dfs.block.size</name>
<value>134217728</value>
</property>
</configuration>

Outcome: All the files will be created with 64MB of block size as defined by hdfs-site.xml on the master node. Master
node has higher precedence than the slave node.

So far so good. We have seen that the master node takes higher precedence. Let's make this a little
interesting by adding “<final>true</final>” to the configuration. Remember that setting final=true has the
highest precedence and overrides all other values defined at other levels.
Scenario 4: Configuration file has different value on slave node with final=true
Hdfs-site.xml on master node

Hdfs-site.xml on slave node

<configuration>
<property>
<name>dfs.block.size</name>
<value>67108864</value>
</property>
</configuration>

<configuration>
<property>
<name>dfs.block.size</name>
<value>134217728</value>
<final>true</final>
</property>
</configuration>

3
Outcome: All the files will be created with 128MB of block size as defined by hdfs-site.xml on the slave node. Slave node
has higher precedence than the master node because slave node has final=true.
Scenario 5: Configuration file has different value on master and slave node with final=true
Hdfs-site.xml on master node

Hdfs-site.xml on slave node

<configuration>
<property>
<name>dfs.block.size</name>
<value>67108864</value>
<final>true</final>
</property>
</configuration>

<configuration>
<property>
<name>dfs.block.size</name>
<value>134217728</value>
<final>true</final>
</property>
</configuration>

Outcome: All the files will be created with 128MB of block size as defined by hdfs-site.xml on the slave node. Slave node
has higher precedence than the master node because slave node has final=true. Configuration on master node is ignored
in this case.
Scenario 6: Configuration file has different value on multiple slave nodes with final=true on some of the nodes
Hdfs-site.xml master node

Hdfs-site.xml on some of the slave nodes

<configuration>
<property>
<name>dfs.block.size</name>
<value>67108864</value>
</property>
</configuration>

<configuration>
<property>
<name>dfs.block.size</name>
<value>134217728</value>
<final>true</final>
</property>
</configuration>

Hdfs-site.xml some other slave nodes
<configuration>
<property>
<name>dfs.block.size</name>
<value>33554432</value>
</property>
</configuration>

Outcome:
 data nodes with final=true will create block size of 128MB
 data nodes that do not have final=true will take the value from master node and will create block size of 64 MB
 data nodes that have block size of 32MB configured will create the blocks of 64MB size as specified by the master
node

4
Scenario 7: Configuration file has different value on multiple slave nodes with final=true on all the nodes
Hdfs-site.xml master node

Hdfs-site.xml on some of the slave nodes

<configuration>
<property>
<name>dfs.block.size</name>
<value>67108864</value>
<final>true</final>
</property>
</configuration>

<configuration>
<property>
<name>dfs.block.size</name>
<value>134217728</value>
<final>true</final>
</property>
</configuration>

Hdfs-site.xml some other slave nodes
<configuration>
<property>
<name>dfs.block.size</name>
<value>33554432</value>
<final>true</final>
</property>
</configuration>

Outcome:
 data nodes with final=true will create block size of 128MB where the block size is defined as 128MB
 data nodes with final=true will create block size of 32 MB where the block size is defined as 32MB
 data nodes that do not have final=true will create block size of 64MB as defined by master node

When files are created using client side utility
The configuration parameter “dfs.block.size” defined within hdfs-site.xml on name node and data node is completely
ignored when files are created using client utility like the one given below (Example 2). Client side hdfs-site.xml has the
highest precedence over all others. Configuring “dfs.block.size” on name node hdfs-site.xml and data nodes hdfs-site.xml
with final=true will be ignored as well. If no value is defined for “dfs.block.size” in client side hdfs-site.xml then Hadoop
default of 64MB will be used as block size.
Example 2. Hadoop command line
hadoop fs -D dfs.block.size=67108864 -put somelargedatafile.txt /user/aanand

Scenario 8: Configuration file on master node has final=true and data nodes do not have final=true. The file is being
transferred with block size defined as parameter on the command line
Hdfs-site.xml master node

Hdfs-site.xml on the slave nodes

<configuration>
<property>
<name>dfs.block.size</name>
<value>67108864</value>
<final>true</final>
</property>
</configuration>

<configuration>
<property>
<name>dfs.block.size</name>
<value>134217728</value>
</property>
</configuration>

Command Executed:
hadoop fs -D dfs.block.size=33554432 -put /tmp/somelargefile.txt /user/aanand

5
Outcome:
 File is created with 32MB of block size even though the “dfs.block.size” is defined and final=true on the name
node.
 The NN / DN configuration files have no impact on client side, client reads the value from hdfs-site.xml if defined.
Hadoop default of 64MB is used if client side hdfs-site.xml does not define any value.

Scenario 9: Configuration file on master node has final=true and data nodes also have final=true. The file is being
transferred with block size defined as parameter on the command line
Hdfs-site.xml master node

Hdfs-site.xml on the slave nodes

<configuration>
<property>
<name>dfs.block.size</name>
<value>67108864</value>
<final>true</final>
</property>
</configuration>

<configuration>
<property>
<name>dfs.block.size</name>
<value>134217728</value>
<final>true</final>
</property>
</configuration>

Command Executed:
hadoop fs -D dfs.block.size=33554432 -put /tmp/somelargefile.txt /user/aanand

Outcome: File is created with 32MB of block size even though the “dfs.block.size” is defined and final=true on the
name/data node.

Conclusion





In case of client side utility like Hadoop the client reads hdfs-site.xml defined on client side and value of
“dfs.block.size” is used
If no value is defined on the client side, Hadoop default of 64MB size is used.
Hadoop default can be overridden by specifying parameter using –D option
In case of MapReduce job the hdfs-site.xml follows the precedence order as explained above.

6

More Related Content

What's hot

How to configure the cluster based on Multi-site (WAN) configuration
How to configure the clusterbased on Multi-site (WAN) configurationHow to configure the clusterbased on Multi-site (WAN) configuration
How to configure the cluster based on Multi-site (WAN) configuration
Akihiro Kitada
 
Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14
jijukjoseph
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
Siddharth Mathur
 
plProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancerplProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancer
elliando dias
 
PostGreSQL Performance Tuning
PostGreSQL Performance TuningPostGreSQL Performance Tuning
PostGreSQL Performance Tuning
Maven Logix
 
DevOpsDays Warsaw 2015: Running High Performance And Fault Tolerant Elasticse...
DevOpsDays Warsaw 2015: Running High Performance And Fault Tolerant Elasticse...DevOpsDays Warsaw 2015: Running High Performance And Fault Tolerant Elasticse...
DevOpsDays Warsaw 2015: Running High Performance And Fault Tolerant Elasticse...
PROIDEA
 
Cassandra 2.1 boot camp, Protocol, Queries, CQL
Cassandra 2.1 boot camp, Protocol, Queries, CQLCassandra 2.1 boot camp, Protocol, Queries, CQL
Cassandra 2.1 boot camp, Protocol, Queries, CQL
Joshua McKenzie
 
Thijs Feryn - Leverage HTTP to deliver cacheable websites - Codemotion Berlin...
Thijs Feryn - Leverage HTTP to deliver cacheable websites - Codemotion Berlin...Thijs Feryn - Leverage HTTP to deliver cacheable websites - Codemotion Berlin...
Thijs Feryn - Leverage HTTP to deliver cacheable websites - Codemotion Berlin...
Codemotion
 
Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)
Denish Patel
 
Mongodb replication
Mongodb replicationMongodb replication
Mongodb replication
PoguttuezhiniVP
 
Postgres the hardway
Postgres the hardwayPostgres the hardway
Postgres the hardway
Dave Pitts
 
SQL Server vs Postgres
SQL Server vs PostgresSQL Server vs Postgres
Percona Toolkit for Effective MySQL Administration
Percona Toolkit for Effective MySQL AdministrationPercona Toolkit for Effective MySQL Administration
Percona Toolkit for Effective MySQL Administration
Mydbops
 
Dns presentation
Dns presentationDns presentation
Dns presentation
gaurav_c
 
PostgreSQL and RAM usage
PostgreSQL and RAM usagePostgreSQL and RAM usage
PostgreSQL and RAM usage
Alexey Bashtanov
 
phptek13 - Caching and tuning fun tutorial
phptek13 - Caching and tuning fun tutorialphptek13 - Caching and tuning fun tutorial
phptek13 - Caching and tuning fun tutorial
Wim Godden
 
MariaDB for developers
MariaDB for developersMariaDB for developers
MariaDB for developers
Colin Charles
 
MariaDB, MySQL and Ansible: automating database infrastructures
MariaDB, MySQL and Ansible: automating database infrastructuresMariaDB, MySQL and Ansible: automating database infrastructures
MariaDB, MySQL and Ansible: automating database infrastructures
Federico Razzoli
 
Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4
Denish Patel
 

What's hot (19)

How to configure the cluster based on Multi-site (WAN) configuration
How to configure the clusterbased on Multi-site (WAN) configurationHow to configure the clusterbased on Multi-site (WAN) configuration
How to configure the cluster based on Multi-site (WAN) configuration
 
Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
plProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancerplProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancer
 
PostGreSQL Performance Tuning
PostGreSQL Performance TuningPostGreSQL Performance Tuning
PostGreSQL Performance Tuning
 
DevOpsDays Warsaw 2015: Running High Performance And Fault Tolerant Elasticse...
DevOpsDays Warsaw 2015: Running High Performance And Fault Tolerant Elasticse...DevOpsDays Warsaw 2015: Running High Performance And Fault Tolerant Elasticse...
DevOpsDays Warsaw 2015: Running High Performance And Fault Tolerant Elasticse...
 
Cassandra 2.1 boot camp, Protocol, Queries, CQL
Cassandra 2.1 boot camp, Protocol, Queries, CQLCassandra 2.1 boot camp, Protocol, Queries, CQL
Cassandra 2.1 boot camp, Protocol, Queries, CQL
 
Thijs Feryn - Leverage HTTP to deliver cacheable websites - Codemotion Berlin...
Thijs Feryn - Leverage HTTP to deliver cacheable websites - Codemotion Berlin...Thijs Feryn - Leverage HTTP to deliver cacheable websites - Codemotion Berlin...
Thijs Feryn - Leverage HTTP to deliver cacheable websites - Codemotion Berlin...
 
Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)
 
Mongodb replication
Mongodb replicationMongodb replication
Mongodb replication
 
Postgres the hardway
Postgres the hardwayPostgres the hardway
Postgres the hardway
 
SQL Server vs Postgres
SQL Server vs PostgresSQL Server vs Postgres
SQL Server vs Postgres
 
Percona Toolkit for Effective MySQL Administration
Percona Toolkit for Effective MySQL AdministrationPercona Toolkit for Effective MySQL Administration
Percona Toolkit for Effective MySQL Administration
 
Dns presentation
Dns presentationDns presentation
Dns presentation
 
PostgreSQL and RAM usage
PostgreSQL and RAM usagePostgreSQL and RAM usage
PostgreSQL and RAM usage
 
phptek13 - Caching and tuning fun tutorial
phptek13 - Caching and tuning fun tutorialphptek13 - Caching and tuning fun tutorial
phptek13 - Caching and tuning fun tutorial
 
MariaDB for developers
MariaDB for developersMariaDB for developers
MariaDB for developers
 
MariaDB, MySQL and Ansible: automating database infrastructures
MariaDB, MySQL and Ansible: automating database infrastructuresMariaDB, MySQL and Ansible: automating database infrastructures
MariaDB, MySQL and Ansible: automating database infrastructures
 
Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4
 

Similar to hadoop

ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
Padma shree. T
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
Edureka!
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Simplilearn
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node Cluster
Edureka!
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
Hafizur Rahman
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
ProTechSkills Training
 
Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFS
Apache Apex
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
Rahul Jain
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
tutorialvillage
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase client
Shashwat Shriparv
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
Edureka!
 
Top 5 Hadoop Admin Tasks
Top 5 Hadoop Admin TasksTop 5 Hadoop Admin Tasks
Top 5 Hadoop Admin Tasks
Edureka!
 
Webinar: Top 5 Hadoop Admin Tasks
Webinar: Top 5 Hadoop Admin TasksWebinar: Top 5 Hadoop Admin Tasks
Webinar: Top 5 Hadoop Admin Tasks
Edureka!
 
Secure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosSecure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With Kerberos
Edureka!
 
Hadoop
HadoopHadoop
Hadoop
Dinakar nk
 
Hadoop training institute in bangalore
Hadoop training institute in bangaloreHadoop training institute in bangalore
Hadoop training institute in bangalore
Kelly Technologies
 
Hadoop training institute in hyderabad
Hadoop training institute in hyderabadHadoop training institute in hyderabad
Hadoop training institute in hyderabad
Kelly Technologies
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
sudhakara st
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop Administration
Edureka!
 
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax
 

Similar to hadoop (20)

ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node Cluster
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFS
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase client
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
 
Top 5 Hadoop Admin Tasks
Top 5 Hadoop Admin TasksTop 5 Hadoop Admin Tasks
Top 5 Hadoop Admin Tasks
 
Webinar: Top 5 Hadoop Admin Tasks
Webinar: Top 5 Hadoop Admin TasksWebinar: Top 5 Hadoop Admin Tasks
Webinar: Top 5 Hadoop Admin Tasks
 
Secure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosSecure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With Kerberos
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop training institute in bangalore
Hadoop training institute in bangaloreHadoop training institute in bangalore
Hadoop training institute in bangalore
 
Hadoop training institute in hyderabad
Hadoop training institute in hyderabadHadoop training institute in hyderabad
Hadoop training institute in hyderabad
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop Administration
 
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
 

Recently uploaded

Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 

Recently uploaded (20)

Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 

hadoop

  • 2. About Me I am an Oracle Certified Database Administrator and Cloudera Certified Apache Hadoop Administrator. I can be contacted at anandkamith@gmail.com Introduction I have read at many places (blogs, books etc.) about the precedence order of HADOOP configuration files and how the configuration parameter “dfs.block.size” is used if defined at multiple levels with different values. These levels are defined as:  Master – When properties are defined at name node / master node level  Slave – When properties are defined at data node / slave node level  Client – There are two type of commands that can be submitted by client o A client utility like “hadoop fs –put” o A MapReduce job submitted by the client I always wanted to see the precedence order in action and hence decide to play with it a little and note down my findings that also encouraged me to write this document. I will try to explain this to the best of my knowledge. When files are created using MapReduce The parameter “dfs.block.size” is defined in “hdfs-site.xml” and can have different values between name node and data nodes. Remember that “dfs.block.size” is client specific and has no effect on NN or DN. The only time NN or DN configuration comes into play is when files are created using MapReduce. Example 1: In hdfs-site.xml – Used by MapReduce. Client uses this only if it is defined <configuration> <property> <name>dfs.block.size</name> <value>67108864</value> </property> </configuration> Defined as 64MB Hadoop environment used For the purpose of this document I am using Cloudera distribution cdh3u6 on Centos 6.4 with Java 1.6 update 26 Scenarios Now let's go through each case scenario where the configuration files have different values between master/slave and see the impact of it on the files that are created in HDFS. 2
  • 3. Scenario 1: Configuration file has same value on master and slave Hdfs-site.xml on master node Hdfs-site.xml on slave node <configuration> <property> <name>dfs.block.size</name> <value>67108864</value> </property> </configuration> <configuration> <property> <name>dfs.block.size</name> <value>67108864</value> </property> </configuration> Outcome: All the files will be created with 64MB of block size. Scenario 2: Configuration file has different value on master node Hdfs-site.xml on master node Hdfs-site.xml on slave node <configuration> <property> <name>dfs.block.size</name> <value>134217728</value> </property> </configuration> <configuration> <property> <name>dfs.block.size</name> <value>67108864</value> </property> </configuration> Outcome: All the files will be created with 128MB of block size as defined by hdfs-site.xml on the master node. Master node has higher precedence than the slave node. Scenario 3: Configuration file has different value on slave node Hdfs-site.xml on master node Hdfs-site.xml on slave node <configuration> <property> <name>dfs.block.size</name> <value>67108864</value> </property> </configuration> <configuration> <property> <name>dfs.block.size</name> <value>134217728</value> </property> </configuration> Outcome: All the files will be created with 64MB of block size as defined by hdfs-site.xml on the master node. Master node has higher precedence than the slave node. So far so good. We have seen that the master node takes higher precedence. Let's make this a little interesting by adding “<final>true</final>” to the configuration. Remember that setting final=true has the highest precedence and overrides all other values defined at other levels. Scenario 4: Configuration file has different value on slave node with final=true Hdfs-site.xml on master node Hdfs-site.xml on slave node <configuration> <property> <name>dfs.block.size</name> <value>67108864</value> </property> </configuration> <configuration> <property> <name>dfs.block.size</name> <value>134217728</value> <final>true</final> </property> </configuration> 3
  • 4. Outcome: All the files will be created with 128MB of block size as defined by hdfs-site.xml on the slave node. Slave node has higher precedence than the master node because slave node has final=true. Scenario 5: Configuration file has different value on master and slave node with final=true Hdfs-site.xml on master node Hdfs-site.xml on slave node <configuration> <property> <name>dfs.block.size</name> <value>67108864</value> <final>true</final> </property> </configuration> <configuration> <property> <name>dfs.block.size</name> <value>134217728</value> <final>true</final> </property> </configuration> Outcome: All the files will be created with 128MB of block size as defined by hdfs-site.xml on the slave node. Slave node has higher precedence than the master node because slave node has final=true. Configuration on master node is ignored in this case. Scenario 6: Configuration file has different value on multiple slave nodes with final=true on some of the nodes Hdfs-site.xml master node Hdfs-site.xml on some of the slave nodes <configuration> <property> <name>dfs.block.size</name> <value>67108864</value> </property> </configuration> <configuration> <property> <name>dfs.block.size</name> <value>134217728</value> <final>true</final> </property> </configuration> Hdfs-site.xml some other slave nodes <configuration> <property> <name>dfs.block.size</name> <value>33554432</value> </property> </configuration> Outcome:  data nodes with final=true will create block size of 128MB  data nodes that do not have final=true will take the value from master node and will create block size of 64 MB  data nodes that have block size of 32MB configured will create the blocks of 64MB size as specified by the master node 4
  • 5. Scenario 7: Configuration file has different value on multiple slave nodes with final=true on all the nodes Hdfs-site.xml master node Hdfs-site.xml on some of the slave nodes <configuration> <property> <name>dfs.block.size</name> <value>67108864</value> <final>true</final> </property> </configuration> <configuration> <property> <name>dfs.block.size</name> <value>134217728</value> <final>true</final> </property> </configuration> Hdfs-site.xml some other slave nodes <configuration> <property> <name>dfs.block.size</name> <value>33554432</value> <final>true</final> </property> </configuration> Outcome:  data nodes with final=true will create block size of 128MB where the block size is defined as 128MB  data nodes with final=true will create block size of 32 MB where the block size is defined as 32MB  data nodes that do not have final=true will create block size of 64MB as defined by master node When files are created using client side utility The configuration parameter “dfs.block.size” defined within hdfs-site.xml on name node and data node is completely ignored when files are created using client utility like the one given below (Example 2). Client side hdfs-site.xml has the highest precedence over all others. Configuring “dfs.block.size” on name node hdfs-site.xml and data nodes hdfs-site.xml with final=true will be ignored as well. If no value is defined for “dfs.block.size” in client side hdfs-site.xml then Hadoop default of 64MB will be used as block size. Example 2. Hadoop command line hadoop fs -D dfs.block.size=67108864 -put somelargedatafile.txt /user/aanand Scenario 8: Configuration file on master node has final=true and data nodes do not have final=true. The file is being transferred with block size defined as parameter on the command line Hdfs-site.xml master node Hdfs-site.xml on the slave nodes <configuration> <property> <name>dfs.block.size</name> <value>67108864</value> <final>true</final> </property> </configuration> <configuration> <property> <name>dfs.block.size</name> <value>134217728</value> </property> </configuration> Command Executed: hadoop fs -D dfs.block.size=33554432 -put /tmp/somelargefile.txt /user/aanand 5
  • 6. Outcome:  File is created with 32MB of block size even though the “dfs.block.size” is defined and final=true on the name node.  The NN / DN configuration files have no impact on client side, client reads the value from hdfs-site.xml if defined. Hadoop default of 64MB is used if client side hdfs-site.xml does not define any value. Scenario 9: Configuration file on master node has final=true and data nodes also have final=true. The file is being transferred with block size defined as parameter on the command line Hdfs-site.xml master node Hdfs-site.xml on the slave nodes <configuration> <property> <name>dfs.block.size</name> <value>67108864</value> <final>true</final> </property> </configuration> <configuration> <property> <name>dfs.block.size</name> <value>134217728</value> <final>true</final> </property> </configuration> Command Executed: hadoop fs -D dfs.block.size=33554432 -put /tmp/somelargefile.txt /user/aanand Outcome: File is created with 32MB of block size even though the “dfs.block.size” is defined and final=true on the name/data node. Conclusion     In case of client side utility like Hadoop the client reads hdfs-site.xml defined on client side and value of “dfs.block.size” is used If no value is defined on the client side, Hadoop default of 64MB size is used. Hadoop default can be overridden by specifying parameter using –D option In case of MapReduce job the hdfs-site.xml follows the precedence order as explained above. 6