SlideShare a Scribd company logo
Practical Steps to Improve Hive Queries
Performance
Sergey Kovalev
Software Engineer at Altoros
How Hive works
1. Use partitions whenever possible
/folder1/video_data/file1
id, title, channelId, description, uploadYear
1, title1, channelId1, description1, 2012
2, title2, channelId2, description2, 2012
3, title3, channelId3, description3, 2013
4, title4, channelId4, description4, 2013
/folder1/video_data/2012/file1
1, title1, channelId1, description1, 2012
2, title2, channelId2, description2, 2012
/folder1/video_data/2013/file1
3, title3, channelId3, description3, 2013
4, title4, channelId4, description4, 2013
SELECT * from video WHERE uploadYear=’2013-04-08’
1. Use partitions whenever possible
create table video (
id STRING,
title STRING,
description STRING,
viewCount BIGINT
) PARTITIONED BY (uploadYear date)
STORED AS ORC;
insert into table video PARTITION (uploadYear) select * from video_external;
2. Use bucketing
create table video (
id STRING,
channelId STRING,
title STRING,
description STRING,
) CLUSTERED BY(channelId)
INTO 2 BUCKETS
STORED AS ORC;
create table channel (
id STRING,
title STRING,
description STRING,
viewCount BIGINT
) CLUSTERED BY(id)
INTO 2 BUCKETS
STORED AS ORC;
SELECT v.title FROM video v JOIN channel ch ON v.channelId = ch.id WHERE
ch.viewCount>1000
2. Use bucketing
/folder1/video_data/file1
id, title, channelId, description, uploadYear
1, title1, channelId1, description1, 2012
2, title2, channelId2, description2, 2012
3, title3, channelId3, description3, 2012
4, title4, channelId4, description4, 2012
5, title5, channelId5, description5, 2013
6, title6, channelId6, description6, 2013
7, title7, channelId7, description7, 2013
8, title8, channelId8, description8, 2013
/folder1/video_data/file1
2, title2, channelId2, description2, 2012
4, title4, channelId4, description4, 2012
6, title6, channelId6, description6, 2013
8, title8, channelId8, description8, 2013
/folder1/video_data/file2
1, title1, channelId1, description1, 2012
3, title3, channelId3, description3, 2012
5, title5, channelId5, description5, 2013
7, title7, channelId7, description7, 2013
2. Use bucketing
/folder1/channel_data/file1
id, title, description, viewCount
channelId1, title1, description1, viewCount1
channelId2, title2, description2, viewCount2
channelId3, title3, description3, viewCount3
channelId4, title4, description4, viewCount4
channelId5, title5, description5, viewCount5
channelId6, title6, description6, viewCount6
channelId7, title7, description7, viewCount7
channelId8, title8, description8, viewCount8
/folder1/channel_data/file1
channelId2, title2, description2, viewCount2
channelId4, title4, description4, viewCount4
channelId6, title6, description6, viewCount6
channelId8, title8, description8, viewCount8
/folder1/channel_data/file2
channelId1, title1, description1, viewCount1
channelId3, title3, description3, viewCount3
channelId5, title5, description5, viewCount5
channelId7, title7, description7, viewCount7
3. Partitions + bucketing
create table video (
id STRING,
channelId STRING,
title STRING,
description STRING,
viewCount BIGINT
) PARTITIONED BY (uploadYear date)
CLUSTERED BY(channelId)
INTO 2 BUCKETS
STORED AS ORC;
3. Partitions + bucketing
/folder1/video_data/file1
id, title, channelId, viewCount, uploadYear
1, title1, channelId1, viewCount1, 2012
2, title2, channelId2, viewCount2, 2012
3, title3, channelId3, viewCount3, 2012
4, title4, channelId4, viewCount4, 2012
5, title5, channelId5, viewCount5, 2013
6, title6, channelId6, viewCount6, 2013
7, title7, channelId7, viewCount7, 2013
8, title8, channelId8, viewCount8, 2013
/folder1/video_data/2012/file1
2, title2, description2, viewCount2, 2012
4, title4, description4, viewCount4, 2012
/folder1/video_data/2012/file2
1, title1, description1, viewCount1, 2012
3, title3, description3, viewCount3, 2012
/folder1/video_data/2013/file1
6, title6, description6, viewCount6, 2013
8, title8, description8, viewCount8, 2013
/folder1/video_data/2013/file2
5, title5, description5, viewCount5, 2013
7, title7, description7, viewCount7, 2013
4. Use joins optimization
Shuffle join/Common join:
4. Use joins optimization
Map-side join:
4. Use joins optimization
Sort-merge-bucket (SMB) join:
5. Choose the right input format
Row Data Column Store
6. Other optimization
Avoid highly normalized table structures
Compress map/reduce output
For map output compression, execute set mapred.compress.map.output = true.
For job output compression, execute set mapred.output.compress = true.
Use parallel execution
SET hive.exce.parallel=true;
7. Use the 'explain' keyword to improve the query
execution plan
EXPLAIN query...
7. Use the 'explain' keyword to improve the query
execution plan
8. Stinger Initiative
Use cost-based optimization
Use vectorization
Transactions with ACID semantics
8. Hive on Tez
8. Sub-Second Queries with Hive LLAP
New approach using a hybrid engine that leverages Tez and something new called LLAP (Live
Long and Process)
Questiones?

More Related Content

Similar to Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

Building a Video Encoding Pipeline at The New York Times
Building a Video Encoding Pipeline at The New York TimesBuilding a Video Encoding Pipeline at The New York Times
Building a Video Encoding Pipeline at The New York Times
Flávio Ribeiro
 
初心者Scala in f@n 第五回 sbt+giter8
初心者Scala in f@n 第五回 sbt+giter8初心者Scala in f@n 第五回 sbt+giter8
初心者Scala in f@n 第五回 sbt+giter8
gak2223
 
Building a Video Encoding Pipeline at The New York Times
Building a Video Encoding Pipeline at The New York TimesBuilding a Video Encoding Pipeline at The New York Times
Building a Video Encoding Pipeline at The New York Times
Maxwell Dayvson Da Silva
 
Cassandra 3.0 advanced preview
Cassandra 3.0 advanced previewCassandra 3.0 advanced preview
Cassandra 3.0 advanced preview
Patrick McFadin
 
Building Killr Applications with DataStax Enterprise
Building Killr Applications with  DataStax EnterpriseBuilding Killr Applications with  DataStax Enterprise
Building Killr Applications with DataStax Enterprise
DataStax
 
Building Killr Applications with DSE
Building Killr Applications with DSEBuilding Killr Applications with DSE
Building Killr Applications with DSE
DataStax
 
FileWrite.javaFileWrite.java  To change this license header.docx
FileWrite.javaFileWrite.java  To change this license header.docxFileWrite.javaFileWrite.java  To change this license header.docx
FileWrite.javaFileWrite.java  To change this license header.docx
ssuser454af01
 
Darknet yolo
Darknet yoloDarknet yolo
Darknet yolo
Bang Tsui Liou
 
SQL server Backup Restore Revealed
SQL server Backup Restore RevealedSQL server Backup Restore Revealed
SQL server Backup Restore Revealed
Antonios Chatzipavlis
 
BOXEE apps API
BOXEE apps APIBOXEE apps API
BOXEE apps API
idancohen
 
Scale Your Data Tier With Windows Server App Fabric
Scale Your Data Tier With Windows Server App FabricScale Your Data Tier With Windows Server App Fabric
Scale Your Data Tier With Windows Server App Fabric
Chris Dufour
 
EDI Training Module 11: Publishing Data in the EDI Repository
EDI Training Module 11:  Publishing Data in the EDI RepositoryEDI Training Module 11:  Publishing Data in the EDI Repository
EDI Training Module 11: Publishing Data in the EDI Repository
Environmental Data Initiative
 
HDF5 Tools
HDF5 ToolsHDF5 Tools
TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9
TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9
TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9
Nuno Godinho
 
Skyfire log files100411
Skyfire log files100411Skyfire log files100411
Skyfire log files100411
navaidkhan
 
short_intro_to_CMake_(inria_REVES_team)
short_intro_to_CMake_(inria_REVES_team)short_intro_to_CMake_(inria_REVES_team)
short_intro_to_CMake_(inria_REVES_team)
Jérôme Esnault
 
Standards For Java Coding
Standards For Java CodingStandards For Java Coding
Standards For Java Coding
Rahul Bhutkar
 
Hey Relational Developer, Let's Go Crazy (Patrick McFadin, DataStax) | Cassan...
Hey Relational Developer, Let's Go Crazy (Patrick McFadin, DataStax) | Cassan...Hey Relational Developer, Let's Go Crazy (Patrick McFadin, DataStax) | Cassan...
Hey Relational Developer, Let's Go Crazy (Patrick McFadin, DataStax) | Cassan...
DataStax
 
IP3build.xml Builds, tests, and runs the project IP3..docx
IP3build.xml      Builds, tests, and runs the project IP3..docxIP3build.xml      Builds, tests, and runs the project IP3..docx
IP3build.xml Builds, tests, and runs the project IP3..docx
christiandean12115
 
Verilog tutorial
Verilog tutorialVerilog tutorial
Verilog tutorial
amnis_azeneth
 

Similar to Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance (20)

Building a Video Encoding Pipeline at The New York Times
Building a Video Encoding Pipeline at The New York TimesBuilding a Video Encoding Pipeline at The New York Times
Building a Video Encoding Pipeline at The New York Times
 
初心者Scala in f@n 第五回 sbt+giter8
初心者Scala in f@n 第五回 sbt+giter8初心者Scala in f@n 第五回 sbt+giter8
初心者Scala in f@n 第五回 sbt+giter8
 
Building a Video Encoding Pipeline at The New York Times
Building a Video Encoding Pipeline at The New York TimesBuilding a Video Encoding Pipeline at The New York Times
Building a Video Encoding Pipeline at The New York Times
 
Cassandra 3.0 advanced preview
Cassandra 3.0 advanced previewCassandra 3.0 advanced preview
Cassandra 3.0 advanced preview
 
Building Killr Applications with DataStax Enterprise
Building Killr Applications with  DataStax EnterpriseBuilding Killr Applications with  DataStax Enterprise
Building Killr Applications with DataStax Enterprise
 
Building Killr Applications with DSE
Building Killr Applications with DSEBuilding Killr Applications with DSE
Building Killr Applications with DSE
 
FileWrite.javaFileWrite.java  To change this license header.docx
FileWrite.javaFileWrite.java  To change this license header.docxFileWrite.javaFileWrite.java  To change this license header.docx
FileWrite.javaFileWrite.java  To change this license header.docx
 
Darknet yolo
Darknet yoloDarknet yolo
Darknet yolo
 
SQL server Backup Restore Revealed
SQL server Backup Restore RevealedSQL server Backup Restore Revealed
SQL server Backup Restore Revealed
 
BOXEE apps API
BOXEE apps APIBOXEE apps API
BOXEE apps API
 
Scale Your Data Tier With Windows Server App Fabric
Scale Your Data Tier With Windows Server App FabricScale Your Data Tier With Windows Server App Fabric
Scale Your Data Tier With Windows Server App Fabric
 
EDI Training Module 11: Publishing Data in the EDI Repository
EDI Training Module 11:  Publishing Data in the EDI RepositoryEDI Training Module 11:  Publishing Data in the EDI Repository
EDI Training Module 11: Publishing Data in the EDI Repository
 
HDF5 Tools
HDF5 ToolsHDF5 Tools
HDF5 Tools
 
TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9
TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9
TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9
 
Skyfire log files100411
Skyfire log files100411Skyfire log files100411
Skyfire log files100411
 
short_intro_to_CMake_(inria_REVES_team)
short_intro_to_CMake_(inria_REVES_team)short_intro_to_CMake_(inria_REVES_team)
short_intro_to_CMake_(inria_REVES_team)
 
Standards For Java Coding
Standards For Java CodingStandards For Java Coding
Standards For Java Coding
 
Hey Relational Developer, Let's Go Crazy (Patrick McFadin, DataStax) | Cassan...
Hey Relational Developer, Let's Go Crazy (Patrick McFadin, DataStax) | Cassan...Hey Relational Developer, Let's Go Crazy (Patrick McFadin, DataStax) | Cassan...
Hey Relational Developer, Let's Go Crazy (Patrick McFadin, DataStax) | Cassan...
 
IP3build.xml Builds, tests, and runs the project IP3..docx
IP3build.xml      Builds, tests, and runs the project IP3..docxIP3build.xml      Builds, tests, and runs the project IP3..docx
IP3build.xml Builds, tests, and runs the project IP3..docx
 
Verilog tutorial
Verilog tutorialVerilog tutorial
Verilog tutorial
 

More from Olga Lavrentieva

15 10-22 altoros-fact_sheet_st_v4
15 10-22 altoros-fact_sheet_st_v415 10-22 altoros-fact_sheet_st_v4
15 10-22 altoros-fact_sheet_st_v4
Olga Lavrentieva
 
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraАндрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Olga Lavrentieva
 
Владимир Иванов (Oracle): Java: прошлое и будущее
Владимир Иванов (Oracle): Java: прошлое и будущееВладимир Иванов (Oracle): Java: прошлое и будущее
Владимир Иванов (Oracle): Java: прошлое и будущее
Olga Lavrentieva
 
Brug - Web push notification
Brug  - Web push notificationBrug  - Web push notification
Brug - Web push notification
Olga Lavrentieva
 
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Olga Lavrentieva
 
Максим Жилинский: "Контейнеры: под капотом"
Максим Жилинский: "Контейнеры: под капотом"Максим Жилинский: "Контейнеры: под капотом"
Максим Жилинский: "Контейнеры: под капотом"
Olga Lavrentieva
 
Александр Протасеня: "PayPal. Различные способы интеграции"
Александр Протасеня: "PayPal. Различные способы интеграции"Александр Протасеня: "PayPal. Различные способы интеграции"
Александр Протасеня: "PayPal. Различные способы интеграции"
Olga Lavrentieva
 
Сергей Черничков: "Интеграция платежных систем в .Net приложения"
Сергей Черничков: "Интеграция платежных систем в .Net приложения"Сергей Черничков: "Интеграция платежных систем в .Net приложения"
Сергей Черничков: "Интеграция платежных систем в .Net приложения"
Olga Lavrentieva
 
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
Olga Lavrentieva
 
Егор Воробьёв: «Ruby internals»
Егор Воробьёв: «Ruby internals»Егор Воробьёв: «Ruby internals»
Егор Воробьёв: «Ruby internals»
Olga Lavrentieva
 
Андрей Колешко «Что не так с Rails»
Андрей Колешко «Что не так с Rails»Андрей Колешко «Что не так с Rails»
Андрей Колешко «Что не так с Rails»
Olga Lavrentieva
 
Дмитрий Савицкий «Ruby Anti Magic Shield»
Дмитрий Савицкий «Ruby Anti Magic Shield»Дмитрий Савицкий «Ruby Anti Magic Shield»
Дмитрий Савицкий «Ruby Anti Magic Shield»
Olga Lavrentieva
 
Сергей Алексеев «Парное программирование. Удаленно»
Сергей Алексеев «Парное программирование. Удаленно»Сергей Алексеев «Парное программирование. Удаленно»
Сергей Алексеев «Парное программирование. Удаленно»
Olga Lavrentieva
 
«Почему Spark отнюдь не так хорош»
«Почему Spark отнюдь не так хорош»«Почему Spark отнюдь не так хорош»
«Почему Spark отнюдь не так хорош»
Olga Lavrentieva
 
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
Olga Lavrentieva
 
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
Olga Lavrentieva
 
«Дизайн продвинутых нереляционных схем для Big Data»
«Дизайн продвинутых нереляционных схем для Big Data»«Дизайн продвинутых нереляционных схем для Big Data»
«Дизайн продвинутых нереляционных схем для Big Data»
Olga Lavrentieva
 
«Обзор возможностей Open cv»
«Обзор возможностей Open cv»«Обзор возможностей Open cv»
«Обзор возможностей Open cv»Olga Lavrentieva
 
«Нужно больше шин! Eventbus based framework vertx.io»
«Нужно больше шин! Eventbus based framework vertx.io»«Нужно больше шин! Eventbus based framework vertx.io»
«Нужно больше шин! Eventbus based framework vertx.io»
Olga Lavrentieva
 
«Работа с базами данных с использованием Sequel»
«Работа с базами данных с использованием Sequel»«Работа с базами данных с использованием Sequel»
«Работа с базами данных с использованием Sequel»
Olga Lavrentieva
 

More from Olga Lavrentieva (20)

15 10-22 altoros-fact_sheet_st_v4
15 10-22 altoros-fact_sheet_st_v415 10-22 altoros-fact_sheet_st_v4
15 10-22 altoros-fact_sheet_st_v4
 
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraАндрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
 
Владимир Иванов (Oracle): Java: прошлое и будущее
Владимир Иванов (Oracle): Java: прошлое и будущееВладимир Иванов (Oracle): Java: прошлое и будущее
Владимир Иванов (Oracle): Java: прошлое и будущее
 
Brug - Web push notification
Brug  - Web push notificationBrug  - Web push notification
Brug - Web push notification
 
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
 
Максим Жилинский: "Контейнеры: под капотом"
Максим Жилинский: "Контейнеры: под капотом"Максим Жилинский: "Контейнеры: под капотом"
Максим Жилинский: "Контейнеры: под капотом"
 
Александр Протасеня: "PayPal. Различные способы интеграции"
Александр Протасеня: "PayPal. Различные способы интеграции"Александр Протасеня: "PayPal. Различные способы интеграции"
Александр Протасеня: "PayPal. Различные способы интеграции"
 
Сергей Черничков: "Интеграция платежных систем в .Net приложения"
Сергей Черничков: "Интеграция платежных систем в .Net приложения"Сергей Черничков: "Интеграция платежных систем в .Net приложения"
Сергей Черничков: "Интеграция платежных систем в .Net приложения"
 
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
 
Егор Воробьёв: «Ruby internals»
Егор Воробьёв: «Ruby internals»Егор Воробьёв: «Ruby internals»
Егор Воробьёв: «Ruby internals»
 
Андрей Колешко «Что не так с Rails»
Андрей Колешко «Что не так с Rails»Андрей Колешко «Что не так с Rails»
Андрей Колешко «Что не так с Rails»
 
Дмитрий Савицкий «Ruby Anti Magic Shield»
Дмитрий Савицкий «Ruby Anti Magic Shield»Дмитрий Савицкий «Ruby Anti Magic Shield»
Дмитрий Савицкий «Ruby Anti Magic Shield»
 
Сергей Алексеев «Парное программирование. Удаленно»
Сергей Алексеев «Парное программирование. Удаленно»Сергей Алексеев «Парное программирование. Удаленно»
Сергей Алексеев «Парное программирование. Удаленно»
 
«Почему Spark отнюдь не так хорош»
«Почему Spark отнюдь не так хорош»«Почему Spark отнюдь не так хорош»
«Почему Spark отнюдь не так хорош»
 
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
 
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
 
«Дизайн продвинутых нереляционных схем для Big Data»
«Дизайн продвинутых нереляционных схем для Big Data»«Дизайн продвинутых нереляционных схем для Big Data»
«Дизайн продвинутых нереляционных схем для Big Data»
 
«Обзор возможностей Open cv»
«Обзор возможностей Open cv»«Обзор возможностей Open cv»
«Обзор возможностей Open cv»
 
«Нужно больше шин! Eventbus based framework vertx.io»
«Нужно больше шин! Eventbus based framework vertx.io»«Нужно больше шин! Eventbus based framework vertx.io»
«Нужно больше шин! Eventbus based framework vertx.io»
 
«Работа с базами данных с использованием Sequel»
«Работа с базами данных с использованием Sequel»«Работа с базами данных с использованием Sequel»
«Работа с базами данных с использованием Sequel»
 

Recently uploaded

“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 

Recently uploaded (20)

“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 

Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

  • 1. Practical Steps to Improve Hive Queries Performance Sergey Kovalev Software Engineer at Altoros
  • 3. 1. Use partitions whenever possible /folder1/video_data/file1 id, title, channelId, description, uploadYear 1, title1, channelId1, description1, 2012 2, title2, channelId2, description2, 2012 3, title3, channelId3, description3, 2013 4, title4, channelId4, description4, 2013 /folder1/video_data/2012/file1 1, title1, channelId1, description1, 2012 2, title2, channelId2, description2, 2012 /folder1/video_data/2013/file1 3, title3, channelId3, description3, 2013 4, title4, channelId4, description4, 2013 SELECT * from video WHERE uploadYear=’2013-04-08’
  • 4. 1. Use partitions whenever possible create table video ( id STRING, title STRING, description STRING, viewCount BIGINT ) PARTITIONED BY (uploadYear date) STORED AS ORC; insert into table video PARTITION (uploadYear) select * from video_external;
  • 5. 2. Use bucketing create table video ( id STRING, channelId STRING, title STRING, description STRING, ) CLUSTERED BY(channelId) INTO 2 BUCKETS STORED AS ORC; create table channel ( id STRING, title STRING, description STRING, viewCount BIGINT ) CLUSTERED BY(id) INTO 2 BUCKETS STORED AS ORC; SELECT v.title FROM video v JOIN channel ch ON v.channelId = ch.id WHERE ch.viewCount>1000
  • 6. 2. Use bucketing /folder1/video_data/file1 id, title, channelId, description, uploadYear 1, title1, channelId1, description1, 2012 2, title2, channelId2, description2, 2012 3, title3, channelId3, description3, 2012 4, title4, channelId4, description4, 2012 5, title5, channelId5, description5, 2013 6, title6, channelId6, description6, 2013 7, title7, channelId7, description7, 2013 8, title8, channelId8, description8, 2013 /folder1/video_data/file1 2, title2, channelId2, description2, 2012 4, title4, channelId4, description4, 2012 6, title6, channelId6, description6, 2013 8, title8, channelId8, description8, 2013 /folder1/video_data/file2 1, title1, channelId1, description1, 2012 3, title3, channelId3, description3, 2012 5, title5, channelId5, description5, 2013 7, title7, channelId7, description7, 2013
  • 7. 2. Use bucketing /folder1/channel_data/file1 id, title, description, viewCount channelId1, title1, description1, viewCount1 channelId2, title2, description2, viewCount2 channelId3, title3, description3, viewCount3 channelId4, title4, description4, viewCount4 channelId5, title5, description5, viewCount5 channelId6, title6, description6, viewCount6 channelId7, title7, description7, viewCount7 channelId8, title8, description8, viewCount8 /folder1/channel_data/file1 channelId2, title2, description2, viewCount2 channelId4, title4, description4, viewCount4 channelId6, title6, description6, viewCount6 channelId8, title8, description8, viewCount8 /folder1/channel_data/file2 channelId1, title1, description1, viewCount1 channelId3, title3, description3, viewCount3 channelId5, title5, description5, viewCount5 channelId7, title7, description7, viewCount7
  • 8. 3. Partitions + bucketing create table video ( id STRING, channelId STRING, title STRING, description STRING, viewCount BIGINT ) PARTITIONED BY (uploadYear date) CLUSTERED BY(channelId) INTO 2 BUCKETS STORED AS ORC;
  • 9. 3. Partitions + bucketing /folder1/video_data/file1 id, title, channelId, viewCount, uploadYear 1, title1, channelId1, viewCount1, 2012 2, title2, channelId2, viewCount2, 2012 3, title3, channelId3, viewCount3, 2012 4, title4, channelId4, viewCount4, 2012 5, title5, channelId5, viewCount5, 2013 6, title6, channelId6, viewCount6, 2013 7, title7, channelId7, viewCount7, 2013 8, title8, channelId8, viewCount8, 2013 /folder1/video_data/2012/file1 2, title2, description2, viewCount2, 2012 4, title4, description4, viewCount4, 2012 /folder1/video_data/2012/file2 1, title1, description1, viewCount1, 2012 3, title3, description3, viewCount3, 2012 /folder1/video_data/2013/file1 6, title6, description6, viewCount6, 2013 8, title8, description8, viewCount8, 2013 /folder1/video_data/2013/file2 5, title5, description5, viewCount5, 2013 7, title7, description7, viewCount7, 2013
  • 10. 4. Use joins optimization Shuffle join/Common join:
  • 11. 4. Use joins optimization Map-side join:
  • 12. 4. Use joins optimization Sort-merge-bucket (SMB) join:
  • 13. 5. Choose the right input format Row Data Column Store
  • 14. 6. Other optimization Avoid highly normalized table structures Compress map/reduce output For map output compression, execute set mapred.compress.map.output = true. For job output compression, execute set mapred.output.compress = true. Use parallel execution SET hive.exce.parallel=true;
  • 15. 7. Use the 'explain' keyword to improve the query execution plan EXPLAIN query...
  • 16. 7. Use the 'explain' keyword to improve the query execution plan
  • 17. 8. Stinger Initiative Use cost-based optimization Use vectorization Transactions with ACID semantics
  • 18. 8. Hive on Tez
  • 19. 8. Sub-Second Queries with Hive LLAP New approach using a hybrid engine that leverages Tez and something new called LLAP (Live Long and Process)