Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

Practical Steps to Improve Hive Queries
Performance
Sergey Kovalev
Software Engineer at Altoros

1. Use partitions whenever possible
/folder1/video_data/file1
id, title, channelId, description, uploadYear
1, title1, channelId1, description1, 2012
/folder1/video_data/2012/file1
SELECT * from video WHERE uploadYear=’2013-04-08’

1. Use partitions whenever possible
create table video (
id STRING,
title STRING,
description STRING,
viewCount BIGINT
) PARTITIONED BY (uploadYear date)
STORED AS ORC;
insert into table video PARTITION (uploadYear) select * from video_external;

2. Use bucketing
id STRING,
channelId STRING,
title STRING,
description STRING,
) CLUSTERED BY(channelId)
INTO 2 BUCKETS
STORED AS ORC;
create table channel (
id STRING,
title STRING,
description STRING,
viewCount BIGINT
) CLUSTERED BY(id)
INTO 2 BUCKETS
STORED AS ORC;
SELECT v.title FROM video v JOIN channel ch ON v.channelId = ch.id WHERE
ch.viewCount>1000

2. Use bucketing
id, title, channelId, description, uploadYear

2. Use bucketing
/folder1/channel_data/file1
id, title, description, viewCount
channelId1, title1, description1, viewCount1

3. Partitions + bucketing
id STRING,
channelId STRING,
title STRING,
description STRING,
viewCount BIGINT
) PARTITIONED BY (uploadYear date)
CLUSTERED BY(channelId)
INTO 2 BUCKETS
STORED AS ORC;

3. Partitions + bucketing
id, title, channelId, viewCount, uploadYear
1, title1, channelId1, viewCount1, 2012
2, title2, description2, viewCount2, 2012

4. Use joins optimization
Shuffle join/Common join:

Map-side join:

Sort-merge-bucket (SMB) join:

5. Choose the right input format
Row Data Column Store

6. Other optimization
Avoid highly normalized table structures
Compress map/reduce output
For map output compression, execute set mapred.compress.map.output = true.
For job output compression, execute set mapred.output.compress = true.
Use parallel execution
SET hive.exce.parallel=true;

7. Use the 'explain' keyword to improve the query
execution plan
EXPLAIN query...

7. Use the 'explain' keyword to improve the query
execution plan

8. Stinger Initiative
Use cost-based optimization
Use vectorization
Transactions with ACID semantics

8. Sub-Second Queries with Hive LLAP
New approach using a hybrid engine that leverages Tez and something new called LLAP (Live
Long and Process)

Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

More Related Content

Similar to Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

More from Olga Lavrentieva

Recently uploaded

Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance