Hadoop Summit 2014 : Benchmarking Apache Hive at Yahoo Scale

Benchmarking Hive at Yahoo Scale
P R E S E N T E D B Y M i t h u n R a d h a k r i s h n a n ⎪ J u n e 4 , 2 0 1 4
2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a

About myself
2
 HCatalog Committer, Hive
contributor
› Metastore, Notifications, HCatalog APIs
› Integration with Oozie, Data Ingestion
 Other odds and ends
› DistCp
 mithun@apache.org
2014 Hadoop Summit, San Jose, California

About this talk
3
 Introduction to “Yahoo Scale”
 The use-case in Yahoo
 The Benchmark
 The Setup
 The Observations (and, possibly, lessons)
 Fisticuffs

The Y!Grid
4
 16 Hadoop Clusters in YGrid
› 32500 Nodes
› 750K jobs a day
 Hadoop 0.23.10.x, 2.4.x
 Large Datasets
› Daily, hourly, minute-level frequencies
› Terabytes of data, 1000s of files, per dataset instance
 Pig 0.11
 Hive 0.10 / HCatalog 0.5
› => Hive 0.12

Data Processing Use cases
5 2014 Hadoop Summit, San Jose, California
 Pig for Data Pipelines
› Imperative paradigm
› ~45% Hadoop Jobs on Production Clusters
• M/R + Oozie = 41%
 Hive for Ad hoc queries
› SQL
› Relatively smaller number of jobs
• *Major* Uptick
 Use HCatalog for Inter-op

6 Yahoo Confidential & Proprietary
Hive is Currently the Fastest Growing Product on the Grid
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
9.0%
10.0%
0
5
10
15
20
25
30
Mar-13 Apr-13 May-13 Jun-13 Jul-13 Aug-13 Sep-13 Oct-13 Nov-13 Dec-13 Jan-14 Feb-14 Mar-14 Apr-14 May-14
HiveJobs(%ofAllJobs)
AllGridJobs(inMillions)
All Jobs Hive (% of all jobs)
2.4 million
Hive jobs

Business Intelligence Tools
7
 {Tableau, MicroStrategy, Excel, … }
 Challenges:
› Security
• ACLs, Authentication, Encryption over the wire, Full-disk Encryption
› Bandwidth
• Transporting results over ODBC
› Query Latency
• Query execution time
• Cost of query “optimizations”
• “Bad” queries

The Benchmark
8
 TPC-h
› Industry standard (tpc.org/tpch)
› 22 queries
› dbgen –s 1000 –S 3
• Parallelizable
 Reynold Xin’s excellent work:
› https://github.com/rxin
› Transliterated queries to suit Hive 0.9

Relational Diagram
PARTKEY
NAME
MFGR
BRAND
TYPE
SIZE
CONTAINER
COMMENT
RETAILPRICE
PARTKEY
SUPPKEY
AVAILQTY
SUPPLYCOST
COMMENT
SUPPKEY
NAME
ADDRESS
NATIONKEY
PHONE
ACCTBAL
COMMENT
ORDERKEY
PARTKEY
SUPPKEY
LINENUMBER
RETURNFLAG
LINESTATUS
SHIPDATE
COMMITDATE
RECEIPTDATE
SHIPINSTRUCT
SHIPMODE
COMMENT
CUSTKEY
ORDERSTATUS
TOTALPRICE
ORDERDATE
ORDER-
PRIORITY
SHIP-
PRIORITY
CLERK
COMMENT
CUSTKEY
NAME
ADDRESS
PHONE
ACCTBAL
MKTSEGMENT
COMMENT
PART (P_)
SF*200,000
PARTSUPP (PS_)
SF*800,000
LINEITEM (L_)
SF*6,000,000
ORDERS (O_)
SF*1,500,000
CUSTOMER (C_)
SF*150,000
SUPPLIER (S_)
SF*10,000
ORDERKEY
NATIONKEY
EXTENDEDPRICE
DISCOUNT
TAX
QUANTITY
NATIONKEY
NAME
REGIONKEY
NATION (N_)
25
COMMENT
REGIONKEY
NAME
COMMENT
REGION (R_)
5

The Setup
10
› 350 Node cluster
• Xeon boxen: 2 Slots with E5530s => 16 CPUs
• 24GB memory
– NUMA enabled
• 6 SATA drives, 2TB, 7200 RPM Seagates
• RHEL 6.4
• JRE 1.7 (-d64)
• Hadoop 0.23.7+/2.3+, Security turned off
• Tez 0.3.x
• 128MB HDFS block-size
› Downscale tests: 100 Node cluster
• hdfs-balancer.sh

The Prep
11
 Data generation:
› Text data: dbgen on MapReduce
› Transcode to RCFile and ORC: Hive on MR
• insert overwrite table orc_table partition( … ) select * from text_table;
› Partitioning:
• Only for 1TB, 10TB cases
• Perils of dynamic partitioning
› ORC File:
• 64MB stripes, ZLIB Compression

0
500
1000
1500
2000
2500
q1_pricing_summary_report.hive
q2_minimum_cost_supplier.hiveq3_shipping_priority.hive
q4_order_priority
q5_local_supplier_volume.hive
q6_forecast_revenue_change.hiveq7_volume_shipping.hive
q8_na
onal_market_share.hive
q9_product_type_profit.hiveq10_returned_item.hiveq11_important_stock.hive
q12_shipping.hive
q13_customer_distribu
on.hiveq14_promo
on_effect.hive
q15_top_supplier.hive
q16_parts_supplier_rela
onship.hive
q17_small_quan
ty_order_revenue.hive
q18_large_volume_customer.hive
q19_discounted_revenue.hive
q20_poten
al_part_promo
on.hive
q21_suppliers_who_kept_orders_waing.hive
q22_global_sales_opportunity.hive
Time(inseconds)
TPC-h 100GB
Hive 0.10 (Text)
Hive 0.10 RCFile
Hive 0.11 ORC
Hive 0.13 ORC MR
Hive 0.13 ORC Tez

100 GB
14
› 18x speedup over Hive 0.10 (Textfile)
• 6-50x
› 11.8x speedup over Hive 0.10 (RCFile)
• 5-30x
› Average query time: 28 seconds
• Down from 530 (Hive 0.10 Text)
› 85% queries completed in under a minute

0
500
1000
1500
2000
2500
q4_order_priority
q8_na
q9_product_type_profit.hiveq10_returned_item.hive
q11_important_stock.hive
q12_shipping.hive
on.hive
q14_promo
on_effect.hiveq15_top_supplier.hive
onship.hive
q17_small_quan
q20_poten
al_part_promo
on.hive
Time(inseconds)
TPC-h 1TB
Hive 0.10 RC File
Hive 0.11 ORC
Hive 0.12 ORC
Hive 0.13 ORC MR
Hive 0.13 ORC Tez

1 TB
16
• Between 2.5-17x
› Average query time: 172 seconds
• Between 5-947 seconds
• Down from 729 seconds (Hive 0.10 RCFile)
› 61% queries completed in under 2 minutes

0
2000
4000
6000
8000
10000
12000
q1_pricing_summary_report.hiveq2_minim
um_cost_supplier.hive
q3_shipping_priority.hive
q4_order_priorityq5_local_supplier_volume.hiveq6_forecast_revenue_change.hive
q7_volume_shipping.hiveq8_na
q9_product_type_profit.hive
q10_returned_item.hive
q11_im
portant_stock.hive
q12_shipping.hiveq13_customer_distribu
on.hive
q14_promo
on_effect.hive
onship.hive
q17_small_quan
ty_order_revenue.hiveq18_large_volume_customer.hive
q20_poten
al_part_promo
on.hive
q21_suppliers_who_kept_orders_waing.hiveq22_global_sales_opportunity.hive
Time(inseconds)
TPC-h10TB
Hive0.10RCFile
Hive0.11ORC
Hive0.12ORC
Hive0.13ORCMR
Hive0.13ORCTez

10 TB
18
• Between 1.6-10x
› Average query time: 908 seconds (426 seconds excluding outliers)
• Down from 2129 seconds with Hive 0.10 RCFile
– (1712 seconds excluding outliers)
› Q6 still completes in 12 seconds!

Explaining the speed-ups
19
 Hadoop 2.x, et al.
 Tez
› (Arbitrary DAG)-based Execution Engine
› “Playing the gaps” between M&R
• Temporary data and the HDFS
› Feedback loop
› Smart scheduling
› Container re-use
› Pipelined job start-up
 Hive
› Statistics
› “Vector-ized” Execution
 ORC
› PPD

0
100
200
300
400
500
600
700
800
900
1000
q1_pricing_sum
mary_report.hive
q2_m
inim
um
_cost_supplier.hive
q3_shipping_priority.hiveq4_order_priority
q6_forecast_revenue_change.hive
q7_volume_shipping.hive
q8_na
onal_m
arket_share.hive
q10_returned_item
.hive
q11_im
portant_stock.hiveq12_shipping.hive
on.hive
q14_promo
on_effect.hive
onship.hive
q17_small_quan
q18_large_volum
e_customer.hive
q20_poten
al_part_prom
o
on.hive
Time(inseconds)
Vectoriza on
Hive 0.13 Tez ORC
Hive 0.13 Tez ORC Vec

ORC File Layout
 Data is composed of multiple streams per
column
 Index allows for skipping rows (default to
every 10,000 rows), keeping position in
each stream, and min-max for each
column
 Footer contains directory of stream
locations, and the encoding for each
column
 Integer columns are serialized using run-
length encoding
 String columns are serialized using
dictionary for column values, and the
same run length encoding
 Stripe footer is used to find the requested
column’s data streams and adjacent
stream reads are merged File Footer
Postscript
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Stream 2.1
Stream 2.2
Stream 2.3
Stream 2.4

ORC Usage
CREATE TABLE addresses (
name string,
street string,
city string,
state string,
zip int
)
STORED AS orc TBLPROPERTIES ("orc.compress"= "ZLIB");
LOCATION ‘/path/to/addresses’;
ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT orc
SET hive.default.fileformat = orc
SET hive.exec.orc.memory.pool = 0.50 (ORC writer is allowed 50% of JVM heap size by default)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde’
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';
Key Default Comments
orc.compress ZLIB high-level compression (one of NONE, ZLIB, Snappy)
orc.compress.size 262,144 (256 KB) number of bytes in each compression chunk
orc.stripe.size 67,108,864 (64 MB) number of bytes in each stripe. Each ORC stripe is processed in one map task (try 32
MB to cut down on disk I/O)
orc.row.index.stride 10,000 number of rows between index entries (must be >= 1,000). A larger stride-size
increases the probability of not being able to skip the stride, for a predicate.
orc.create.index true whether to create row indexes. This is for predicate push-down (bloom-filters). If data is
frequently accessed/filtered on a certain column, then sorting on the column and using
index-filters makes column filters work faster

0
100
200
300
400
500
600
700
800
900
1000
q4_order_priority
q8_na
q9_product_type_profit.hiveq10_returned_item.hive
q11_important_stock.hive
q12_shipping.hive
on.hive
q14_promo
onship.hive
q17_small_quan
q20_poten
al_part_promo
on.hive
Time(inseconds)
Effects of Compression (1TB)
Hive 0.13 Uncompressed ORC
Hive 0.13 ZLIB Compressed

0
500
1000
1500
2000
2500
3000
q4_order_priority
q8_na
q9_product_type_profit.hiveq10_returned_item.hiveq11_important_stock.hive
q12_shipping.hive
on.hive
q14_promo
onship.hive
q17_small_quan
q20_poten
al_part_promo
on.hive
Time(inseconds)
Effects of Compression (10TB)
Hive 0.13 Uncompressed
Hive 0.13 Compressed

Configuring ORC
25
 set hive.merge.mapredfiles=true
 set hive.merge.mapfiles=true
 set orc.stripe.size=67,108,864
› Half the HDFS block-size
• Prevent cross-block stripe-read
• Tangent: DistCp
 set orc.compress=???
› Depends on size and distribution
› Snappy compression hasn’t been explored
 YMMV
› Experiment

0
100
200
300
400
500
600
700
800
900
1000
q1_pricing_sum
m
ary_report.hive
q2_m
inim
um
_cost_supplier.hive
q3_shipping_priority.hive
q4_order_priority
q5_local_supplier_volum
e.hive
q6_forecast_revenue_change.hive
q7_volum
e_shipping.hive
q8_na
onal_m
arket_share.hive
q10_returned_item
.hive
q11_im
portant_stock.hive
q12_shipping.hive
q13_custom
er_distribu
on.hive
q14_prom
o
on_effect.hive
onship.hive
q17_sm
all_quan
q18_large_volum
e_custom
er.hive
q20_poten
al_part_prom
o
on.hive
q21_suppliers_w
ho_kept_orders_w
ai
ng.hive
Time(inseconds)
100 vs 350 Nodes
Hive 0.13 100 Nodes
Hive 0.13 350 Nodes

Y!Grid sticking with Hive
28
 Familiarity
› Existing ecosystem
 Community
 Scale
 Multitenant
 Coming down the pike
› CBO
› In-memory caching solutions atop HDFS
• RAMfs a la Tachyon?

We’re not done yet
29
 SQL compliance
 Scaling up the metastore
performance
 Better BI Tool integration
 Faster transport
› HiveServer2 result-sets

References
30
 The YDN blog post:
› http://yahoodevelopers.tumblr.com/post/85930551108/yahoo-betting-on-apache-hive-
tez-and-yarn
 Code:
› https://github.com/mythrocks/hivebench (TPC-h scripts, datagen, transcode utils)
› https://github.com/t3rmin4t0r/tpch-gen (Parallel TPC-h gen)
› https://github.com/rxin/TPC-H-Hive (TPC-h scripts for Hive)
› https://issues.apache.org/jira/browse/HIVE-600 (Yuntao’s initial TPC-h JIRA)

Thank You
@mithunrk
mithun@apache.org
We are hiring!
Stop by Kiosk P9
or reach out to us at
bigdata@yahoo-inc.com.

Sharky comments
33
 Testing with Shark 0.7.x and Shark 0.8
› Compatible with Hive Metastore 0.9
› 100GB datasets : Admirable performance
› 1TB/10TB: Tests did not run completely
• Failures, especially in 10TB cases
• Hangs while shuffling data
• Scaled back to 100 nodes -> More tests ran through, but not completely
› nReducers: Not inferred
 Miscellany
› Security
› Multi-tenancy
› Compatibility

Hadoop Summit 2014 : Benchmarking Apache Hive at Yahoo Scale

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Hadoop Summit 2014 : Benchmarking Apache Hive at Yahoo Scale

Similar to Hadoop Summit 2014 : Benchmarking Apache Hive at Yahoo Scale (20)

Recently uploaded

Recently uploaded (20)

Hadoop Summit 2014 : Benchmarking Apache Hive at Yahoo Scale

Editor's Notes