SlideShare a Scribd company logo
1 of 13
Page1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive: Data Organization for Performance
Gopal Vijayaraghavan
Page2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
In this episode
All BigData problems are primarily lookup problems
All Lookup problems are really Storage problems
All Storage problems turn into ETL problems
ETL problems are all about the Data
Data navigation?
Data organization?
Data ingestion?
It’s Big?
Page3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Good idea: Do things that scale!
There are many problems like this, but this one is mine
Page4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Partitions
If you have a database on cars and you
partition on VIN#
If you have a database on sales and you
partition on customer_id
Rule of thumb: Average partition is
>=1Gb and total # of partitions per
query <1000
Page5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Buckets
If you have more files than rows, you’ve
definitely got bucketing wrong
“clustered by” != “cluster by”
Bucketing on a skewed column slows
down ETL a *lot* (for no win)
If you have partitions, sort-merge
bucket-mapjoin can be slower than a
shuffle!!
Page6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Buckets - II
Histograms!
select explode(histogram_numeric(
hash(<col>)% <n-bucket>, <n-bucket>
)) as h from table;
The Curse of 31 & the last byte
If you have buckets & partitions, always
remember to ETL with
set hive.optimize.sort.dynamic.partition=true;
Page7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Denormalization
Denormalization can turn a compute
problem into an IO/lookup problem.
But if you then optimize that with
compression, you get a compute
problem again.
If you think JOINs are bad, you
probably haven’t moved out of
MapReduce.
Broadcast joins are good & dynamically
partitioned broadcast joins can scale
that ~1000x
Page8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Indexes?
Indexes in hive barely help in a
columnar world – incremental rebuild
isn’t really there
ORC maintains internal bloom filter
indexes (PARQUET-41 too)
You can store your indexes as ORC files,
if you want, so that you can have an
index in your index, to speedup indexes
Page9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Schema & Predicate Push Down
Never store a Number as a string,
because guess what “11” < “9” and
“11.0” != “11” – transform, then load
Predicate push-down cannot fight the
type system (♫ … breaking blocks in the
hot sun ♫)
UDFs applied on the data column is
always a bad idea for fast filtering.
If you need case-insensitive lookups,
always store as UPPER/lower.
If you need LIKE “%.twimg.com”, store
like DNS does “.com.twimg…”
Page10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Temporary tables
In-memory temp-tables
set hive.exec.temporary.table.storage=memory;
Easiest way to reorganize data
temporarily or to produce a “distinct
slice”
“create temporary table if not exists stored as
orc as select …”
Can be used for pagination queries to
good effect, for display tools
Page11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Complex types & Nesting
There’s pretty much no advantage to
using structs – they’re nearly
columns, without any of the good
stuff
Maps – not so bad, but handle with
care
Maps are way better than 4000
columns, most of them null
Arrays – ignore mostly
(JDBC!!)
Page12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Schema Evolution
Add columns, never remove them
Schemas are per-partition
Remember, the partitions don’t change their
schema after they’re created
All new inserts have new schema
After schema update, inserting data into old
partitions is a recipe for disaster
Type changes for a column also complicate
things (except for simple stuff like Int ->
BigInt or Float -> Double)
Page13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Questions?
?

More Related Content

What's hot

Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceDataWorks Summit/Hadoop Summit
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hiverxu
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
LLAP Nov Meetup
LLAP Nov MeetupLLAP Nov Meetup
LLAP Nov Meetupt3rmin4t0r
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveDataWorks Summit
 
Hive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationHive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationEyad Garelnabi
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014alanfgates
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleYifeng Jiang
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...DataWorks Summit
 
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015alanfgates
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 

What's hot (20)

Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
LLAP Nov Meetup
LLAP Nov MeetupLLAP Nov Meetup
LLAP Nov Meetup
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Hive Does ACID
Hive Does ACIDHive Does ACID
Hive Does ACID
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Hive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationHive Data Modeling and Query Optimization
Hive Data Modeling and Query Optimization
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014
 
ORC 2015
ORC 2015ORC 2015
ORC 2015
 
HiveACIDPublic
HiveACIDPublicHiveACIDPublic
HiveACIDPublic
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scale
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
 
Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
 

Viewers also liked

Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks
 
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive PerformanceСергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive PerformanceOlga Lavrentieva
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionCloudera, Inc.
 
Hive Demo Paper at VLDB 2009
Hive Demo Paper at VLDB 2009Hive Demo Paper at VLDB 2009
Hive Demo Paper at VLDB 2009Namit Jain
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceDataWorks Summit
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache HiveMurtaza Doctor
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsDataWorks Summit
 
Methods of organizing data
Methods of organizing dataMethods of organizing data
Methods of organizing dataRoxane La'O
 
frequency distribution table
frequency distribution tablefrequency distribution table
frequency distribution tableMonie Ali
 
Presentation of Data and Frequency Distribution
Presentation of Data and Frequency DistributionPresentation of Data and Frequency Distribution
Presentation of Data and Frequency DistributionElain Cruz
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Frequency Distributions and Graphs
Frequency Distributions and GraphsFrequency Distributions and Graphs
Frequency Distributions and Graphsmonritche
 

Viewers also liked (20)

Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive
 
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive PerformanceСергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
 
What's new in Apache Hive
What's new in Apache HiveWhat's new in Apache Hive
What's new in Apache Hive
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
 
Hive Demo Paper at VLDB 2009
Hive Demo Paper at VLDB 2009Hive Demo Paper at VLDB 2009
Hive Demo Paper at VLDB 2009
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache Hive
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
 
Methods Of Organization
Methods Of OrganizationMethods Of Organization
Methods Of Organization
 
Methods of organizing data
Methods of organizing dataMethods of organizing data
Methods of organizing data
 
frequency distribution table
frequency distribution tablefrequency distribution table
frequency distribution table
 
Frequency distribution
Frequency distributionFrequency distribution
Frequency distribution
 
Presentation of Data and Frequency Distribution
Presentation of Data and Frequency DistributionPresentation of Data and Frequency Distribution
Presentation of Data and Frequency Distribution
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Frequency Distributions and Graphs
Frequency Distributions and GraphsFrequency Distributions and Graphs
Frequency Distributions and Graphs
 

Similar to Data organization: hive meetup

Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCCal Henderson
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
Making MySQL Great For Business Intelligence
Making MySQL Great For Business IntelligenceMaking MySQL Great For Business Intelligence
Making MySQL Great For Business IntelligenceCalpont
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes Minio
 
SQL Server In-Memory OLTP introduction (Hekaton)
SQL Server In-Memory OLTP introduction (Hekaton)SQL Server In-Memory OLTP introduction (Hekaton)
SQL Server In-Memory OLTP introduction (Hekaton)Shy Engelberg
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Archguest18a0f1
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Archmclee
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Archroyans
 
ActiveWarehouse/ETL - BI & DW for Ruby/Rails
ActiveWarehouse/ETL - BI & DW for Ruby/RailsActiveWarehouse/ETL - BI & DW for Ruby/Rails
ActiveWarehouse/ETL - BI & DW for Ruby/RailsPaul Gallagher
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesJon Meredith
 
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseApache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseJosh Elser
 
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3DataWorks Summit
 
The Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed DatabaseThe Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed DatabaseArangoDB Database
 
Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016alanfgates
 
Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACIDHortonworks
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 
SAS Programming.ppt
SAS Programming.pptSAS Programming.ppt
SAS Programming.pptssuser660bb1
 
In-Place analytics with Unified Data Access
In-Place analytics with Unified Data AccessIn-Place analytics with Unified Data Access
In-Place analytics with Unified Data AccessDataWorks Summit
 
World-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon RedshiftWorld-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon RedshiftLars Kamp
 

Similar to Data organization: hive meetup (20)

Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Making MySQL Great For Business Intelligence
Making MySQL Great For Business IntelligenceMaking MySQL Great For Business Intelligence
Making MySQL Great For Business Intelligence
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes
 
SQL Server In-Memory OLTP introduction (Hekaton)
SQL Server In-Memory OLTP introduction (Hekaton)SQL Server In-Memory OLTP introduction (Hekaton)
SQL Server In-Memory OLTP introduction (Hekaton)
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
 
ActiveWarehouse/ETL - BI & DW for Ruby/Rails
ActiveWarehouse/ETL - BI & DW for Ruby/RailsActiveWarehouse/ETL - BI & DW for Ruby/Rails
ActiveWarehouse/ETL - BI & DW for Ruby/Rails
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 
Apache Phoenix + Apache HBase
Apache Phoenix + Apache HBaseApache Phoenix + Apache HBase
Apache Phoenix + Apache HBase
 
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseApache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
 
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
 
The Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed DatabaseThe Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed Database
 
Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016
 
Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
SAS Programming.ppt
SAS Programming.pptSAS Programming.ppt
SAS Programming.ppt
 
In-Place analytics with Unified Data Access
In-Place analytics with Unified Data AccessIn-Place analytics with Unified Data Access
In-Place analytics with Unified Data Access
 
World-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon RedshiftWorld-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon Redshift
 

Recently uploaded

Test Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdfTest Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdfkalichargn70th171
 
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit MilanWorkshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit MilanNeo4j
 
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...OnePlan Solutions
 
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypseTomasz Kowalczewski
 
The Strategic Impact of Buying vs Building in Test Automation
The Strategic Impact of Buying vs Building in Test AutomationThe Strategic Impact of Buying vs Building in Test Automation
The Strategic Impact of Buying vs Building in Test AutomationElement34
 
architecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdfarchitecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdfWSO2
 
Microsoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdfMicrosoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdfMarkus Moeller
 
A Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdfA Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdfICS
 
Effective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeConEffective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeConNatan Silnitsky
 
From Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST APIFrom Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST APIInflectra
 
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAShane Coughlan
 
Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024Henry Schreiner
 
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...drm1699
 
^Clinic ^%[+27788225528*Abortion Pills For Sale In witbank
^Clinic ^%[+27788225528*Abortion Pills For Sale In witbank^Clinic ^%[+27788225528*Abortion Pills For Sale In witbank
^Clinic ^%[+27788225528*Abortion Pills For Sale In witbankkasambamuno
 
BusinessGPT - Security and Governance for Generative AI
BusinessGPT  - Security and Governance for Generative AIBusinessGPT  - Security and Governance for Generative AI
BusinessGPT - Security and Governance for Generative AIAGATSoftware
 
Lessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfLessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfSrushith Repakula
 
^Clinic ^%[+27788225528*Abortion Pills For Sale In harare
^Clinic ^%[+27788225528*Abortion Pills For Sale In harare^Clinic ^%[+27788225528*Abortion Pills For Sale In harare
^Clinic ^%[+27788225528*Abortion Pills For Sale In hararekasambamuno
 
Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024Chirag Panchal
 

Recently uploaded (20)

Test Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdfTest Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdf
 
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit MilanWorkshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
 
Abortion Clinic In Polokwane ](+27832195400*)[ 🏥 Safe Abortion Pills in Polok...
Abortion Clinic In Polokwane ](+27832195400*)[ 🏥 Safe Abortion Pills in Polok...Abortion Clinic In Polokwane ](+27832195400*)[ 🏥 Safe Abortion Pills in Polok...
Abortion Clinic In Polokwane ](+27832195400*)[ 🏥 Safe Abortion Pills in Polok...
 
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
 
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
 
The Strategic Impact of Buying vs Building in Test Automation
The Strategic Impact of Buying vs Building in Test AutomationThe Strategic Impact of Buying vs Building in Test Automation
The Strategic Impact of Buying vs Building in Test Automation
 
architecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdfarchitecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdf
 
Microsoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdfMicrosoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdf
 
A Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdfA Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdf
 
Effective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeConEffective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeCon
 
From Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST APIFrom Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST API
 
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
 
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
 
Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024
 
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
 
^Clinic ^%[+27788225528*Abortion Pills For Sale In witbank
^Clinic ^%[+27788225528*Abortion Pills For Sale In witbank^Clinic ^%[+27788225528*Abortion Pills For Sale In witbank
^Clinic ^%[+27788225528*Abortion Pills For Sale In witbank
 
BusinessGPT - Security and Governance for Generative AI
BusinessGPT  - Security and Governance for Generative AIBusinessGPT  - Security and Governance for Generative AI
BusinessGPT - Security and Governance for Generative AI
 
Lessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfLessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdf
 
^Clinic ^%[+27788225528*Abortion Pills For Sale In harare
^Clinic ^%[+27788225528*Abortion Pills For Sale In harare^Clinic ^%[+27788225528*Abortion Pills For Sale In harare
^Clinic ^%[+27788225528*Abortion Pills For Sale In harare
 
Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024
 

Data organization: hive meetup

  • 1. Page1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive: Data Organization for Performance Gopal Vijayaraghavan
  • 2. Page2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved In this episode All BigData problems are primarily lookup problems All Lookup problems are really Storage problems All Storage problems turn into ETL problems ETL problems are all about the Data Data navigation? Data organization? Data ingestion? It’s Big?
  • 3. Page3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Good idea: Do things that scale! There are many problems like this, but this one is mine
  • 4. Page4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Partitions If you have a database on cars and you partition on VIN# If you have a database on sales and you partition on customer_id Rule of thumb: Average partition is >=1Gb and total # of partitions per query <1000
  • 5. Page5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Buckets If you have more files than rows, you’ve definitely got bucketing wrong “clustered by” != “cluster by” Bucketing on a skewed column slows down ETL a *lot* (for no win) If you have partitions, sort-merge bucket-mapjoin can be slower than a shuffle!!
  • 6. Page6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Buckets - II Histograms! select explode(histogram_numeric( hash(<col>)% <n-bucket>, <n-bucket> )) as h from table; The Curse of 31 & the last byte If you have buckets & partitions, always remember to ETL with set hive.optimize.sort.dynamic.partition=true;
  • 7. Page7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Denormalization Denormalization can turn a compute problem into an IO/lookup problem. But if you then optimize that with compression, you get a compute problem again. If you think JOINs are bad, you probably haven’t moved out of MapReduce. Broadcast joins are good & dynamically partitioned broadcast joins can scale that ~1000x
  • 8. Page8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Indexes? Indexes in hive barely help in a columnar world – incremental rebuild isn’t really there ORC maintains internal bloom filter indexes (PARQUET-41 too) You can store your indexes as ORC files, if you want, so that you can have an index in your index, to speedup indexes
  • 9. Page9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Schema & Predicate Push Down Never store a Number as a string, because guess what “11” < “9” and “11.0” != “11” – transform, then load Predicate push-down cannot fight the type system (♫ … breaking blocks in the hot sun ♫) UDFs applied on the data column is always a bad idea for fast filtering. If you need case-insensitive lookups, always store as UPPER/lower. If you need LIKE “%.twimg.com”, store like DNS does “.com.twimg…”
  • 10. Page10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Temporary tables In-memory temp-tables set hive.exec.temporary.table.storage=memory; Easiest way to reorganize data temporarily or to produce a “distinct slice” “create temporary table if not exists stored as orc as select …” Can be used for pagination queries to good effect, for display tools
  • 11. Page11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Complex types & Nesting There’s pretty much no advantage to using structs – they’re nearly columns, without any of the good stuff Maps – not so bad, but handle with care Maps are way better than 4000 columns, most of them null Arrays – ignore mostly (JDBC!!)
  • 12. Page12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Schema Evolution Add columns, never remove them Schemas are per-partition Remember, the partitions don’t change their schema after they’re created All new inserts have new schema After schema update, inserting data into old partitions is a recipe for disaster Type changes for a column also complicate things (except for simple stuff like Int -> BigInt or Float -> Double)
  • 13. Page13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Questions? ?