Performance Analysis
Big Data Environment
Shivkumar Babshetty
Sr Big data Administrator
1 ) Replication factor
For existing environment replication factor set to 3 with 5 data
nodes.
3 namespace node s.
1 Primary node and 1 secondary named
Data replication will happen asynchronously, there is no guaranty
by HDFS minimum replication
Set dfs.replication.min=1 in HDFS-SITE.xml file, currently not
set , it we set at least 1 copy data should be there before
replicating.
dfs.replication = 3 exist in environment
2)YARN Recommendation
YARN scheduler minimum allocation MB is 3584 MB, max 51200
MN and cores 1, 32.
Yarn scheduler increment allocation vcores and mb not set for
existing environment. By this parameter for memory alignment ,
making sure proper memory access to container.
Increase the min=4 cores and inc =2 cores and increment mb to
1792 MB
3)Map Reduce Application master
YARN application master memory and cores are not allocated, we
can allocate the memory and cores to existing environment and
improve the performance of AM jobs
Yarn.app.mr.am.resorce.cpu.cores=1 and resource.mb=1024
5)JVM reuse
mapred.job.reuse.jvm.num.tasks  property to set the
maximum number of tasks for a single job , which will be executed
in the single JVM. Default value is 1 .
If We can set the value to -1 there is not limit JVM in
MAPRED-SITE.xml
6)Splits size of MAP output
Minimum splits size of MAP out put, will increase the
performance, if its small size reduce task will be
overhead of multiple files
This is for block_size=256 MB
7) Block Size
If we increase the block_size from existing 128 MB to 256 MB, job
performance will be improved and minimum splits for MAP task will
be increased
We running long jobs in cluster for 4-6 hour, on big input data and
data is around 15 to 40 GB compresses data
8)Partition your Data on Big sql database
For existing the table data input in Big sql database is getting non
partitioned on daily basis from Netz to Big sql
If we create the partition on table, performance will be improved
CREATE HADOOP TABLE LINEITEM (
L_ORDERKEY BIGINT NOT NULL,
L_TAX FLOAT NOT NULL,
L_SHIPINSTRUCT VARCHAR(25) NOT NULL,
L_SHIPMODE VARCHAR(10) NOT NULL,
L_COMMENT VARCHAR(44) NOT NULL)
PARTITIONED BY (L_SHIPDATE VARCHAR(10))
STORED AS PARQUETFILE;
9) Rack Aware
For Node the number RACK used for building the environment is 1
RACK.
Hadoop Recommended 2 RACKS .
On 2nd
RACKs 1 data node and Secondary or Standby Namenode
should be there on different RACK.
If RACK it self got failed, than complete data loss.
We can manage the datanode failure, data will be there on other
node.
10) MAP task compression
For running jobs in hadoop on MAP task, compression is not on MAP
out , while Reduce task on job it need to check on MAP output .
If we compress the MAP output, performance will be improved.
Set values as below
Existing Setting in mapred-site.xml

Advanced Hadoop Tuning and Optimization

  • 1.
    Performance Analysis Big DataEnvironment Shivkumar Babshetty Sr Big data Administrator
  • 2.
    1 ) Replicationfactor For existing environment replication factor set to 3 with 5 data nodes. 3 namespace node s. 1 Primary node and 1 secondary named Data replication will happen asynchronously, there is no guaranty by HDFS minimum replication Set dfs.replication.min=1 in HDFS-SITE.xml file, currently not set , it we set at least 1 copy data should be there before replicating. dfs.replication = 3 exist in environment
  • 3.
    2)YARN Recommendation YARN schedulerminimum allocation MB is 3584 MB, max 51200 MN and cores 1, 32. Yarn scheduler increment allocation vcores and mb not set for existing environment. By this parameter for memory alignment , making sure proper memory access to container. Increase the min=4 cores and inc =2 cores and increment mb to 1792 MB
  • 4.
    3)Map Reduce Applicationmaster YARN application master memory and cores are not allocated, we can allocate the memory and cores to existing environment and improve the performance of AM jobs Yarn.app.mr.am.resorce.cpu.cores=1 and resource.mb=1024
  • 5.
    5)JVM reuse mapred.job.reuse.jvm.num.tasks  propertyto set the maximum number of tasks for a single job , which will be executed in the single JVM. Default value is 1 . If We can set the value to -1 there is not limit JVM in MAPRED-SITE.xml
  • 6.
    6)Splits size ofMAP output Minimum splits size of MAP out put, will increase the performance, if its small size reduce task will be overhead of multiple files This is for block_size=256 MB
  • 7.
    7) Block Size Ifwe increase the block_size from existing 128 MB to 256 MB, job performance will be improved and minimum splits for MAP task will be increased We running long jobs in cluster for 4-6 hour, on big input data and data is around 15 to 40 GB compresses data
  • 8.
    8)Partition your Dataon Big sql database For existing the table data input in Big sql database is getting non partitioned on daily basis from Netz to Big sql If we create the partition on table, performance will be improved CREATE HADOOP TABLE LINEITEM ( L_ORDERKEY BIGINT NOT NULL, L_TAX FLOAT NOT NULL, L_SHIPINSTRUCT VARCHAR(25) NOT NULL, L_SHIPMODE VARCHAR(10) NOT NULL, L_COMMENT VARCHAR(44) NOT NULL) PARTITIONED BY (L_SHIPDATE VARCHAR(10)) STORED AS PARQUETFILE;
  • 9.
    9) Rack Aware ForNode the number RACK used for building the environment is 1 RACK. Hadoop Recommended 2 RACKS . On 2nd RACKs 1 data node and Secondary or Standby Namenode should be there on different RACK. If RACK it self got failed, than complete data loss. We can manage the datanode failure, data will be there on other node.
  • 10.
    10) MAP taskcompression For running jobs in hadoop on MAP task, compression is not on MAP out , while Reduce task on job it need to check on MAP output . If we compress the MAP output, performance will be improved. Set values as below Existing Setting in mapred-site.xml

Editor's Notes

  • #5 Any area where large amounts of historic data that if understood better can help shape future decisions.