Advanced Hadoop Tuning and Optimization

Performance Analysis
Big Data Environment
Shivkumar Babshetty
Sr Big data Administrator

1 ) Replication factor
For existing environment replication factor set to 3 with 5 data
nodes.
3 namespace node s.
1 Primary node and 1 secondary named
Data replication will happen asynchronously, there is no guaranty
by HDFS minimum replication
Set dfs.replication.min=1 in HDFS-SITE.xml file, currently not
set , it we set at least 1 copy data should be there before
replicating.
dfs.replication = 3 exist in environment

2)YARN Recommendation
YARN scheduler minimum allocation MB is 3584 MB, max 51200
MN and cores 1, 32.
Yarn scheduler increment allocation vcores and mb not set for
existing environment. By this parameter for memory alignment ,
making sure proper memory access to container.
Increase the min=4 cores and inc =2 cores and increment mb to
1792 MB

3)Map Reduce Application master
YARN application master memory and cores are not allocated, we
can allocate the memory and cores to existing environment and
improve the performance of AM jobs
Yarn.app.mr.am.resorce.cpu.cores=1 and resource.mb=1024

5)JVM reuse
mapred.job.reuse.jvm.num.tasks property to set the
maximum number of tasks for a single job , which will be executed
in the single JVM. Default value is 1 .
If We can set the value to -1 there is not limit JVM in
MAPRED-SITE.xml

6)Splits size of MAP output
Minimum splits size of MAP out put, will increase the
performance, if its small size reduce task will be
overhead of multiple files
This is for block_size=256 MB

7) Block Size
If we increase the block_size from existing 128 MB to 256 MB, job
performance will be improved and minimum splits for MAP task will
be increased
We running long jobs in cluster for 4-6 hour, on big input data and
data is around 15 to 40 GB compresses data

8)Partition your Data on Big sql database
For existing the table data input in Big sql database is getting non
partitioned on daily basis from Netz to Big sql
If we create the partition on table, performance will be improved
CREATE HADOOP TABLE LINEITEM (
L_ORDERKEY BIGINT NOT NULL,
L_TAX FLOAT NOT NULL,
L_SHIPINSTRUCT VARCHAR(25) NOT NULL,
L_SHIPMODE VARCHAR(10) NOT NULL,
L_COMMENT VARCHAR(44) NOT NULL)
PARTITIONED BY (L_SHIPDATE VARCHAR(10))
STORED AS PARQUETFILE;

9) Rack Aware
For Node the number RACK used for building the environment is 1
RACK.
Hadoop Recommended 2 RACKS .
On 2nd
RACKs 1 data node and Secondary or Standby Namenode
should be there on different RACK.
If RACK it self got failed, than complete data loss.
We can manage the datanode failure, data will be there on other
node.

10) MAP task compression
For running jobs in hadoop on MAP task, compression is not on MAP
out , while Reduce task on job it need to check on MAP output .
If we compress the MAP output, performance will be improved.
Set values as below
Existing Setting in mapred-site.xml

Advanced Hadoop Tuning and Optimization

More Related Content

What's hot

Viewers also liked

Similar to Advanced Hadoop Tuning and Optimization

Recently uploaded

Advanced Hadoop Tuning and Optimization

Editor's Notes