1. Benchmarking Hive at Yahoo Scale
P R E S E N T E D B Y M i t h u n R a d h a k r i s h n a n ⎪ J u n e 4 , 2 0 1 4
2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
2. About myself
2
HCatalog Committer, Hive
contributor
› Metastore, Notifications, HCatalog APIs
› Integration with Oozie, Data Ingestion
Other odds and ends
› DistCp
mithun@apache.org
2014 Hadoop Summit, San Jose, California
3. About this talk
3
Introduction to “Yahoo Scale”
The use-case in Yahoo
The Benchmark
The Setup
The Observations (and, possibly, lessons)
Fisticuffs
2014 Hadoop Summit, San Jose, California
4. The Y!Grid
4
16 Hadoop Clusters in YGrid
› 32500 Nodes
› 750K jobs a day
Hadoop 0.23.10.x, 2.4.x
Large Datasets
› Daily, hourly, minute-level frequencies
› Terabytes of data, 1000s of files, per dataset instance
Pig 0.11
Hive 0.10 / HCatalog 0.5
› => Hive 0.12
2014 Hadoop Summit, San Jose, California
5. Data Processing Use cases
5 2014 Hadoop Summit, San Jose, California
Pig for Data Pipelines
› Imperative paradigm
› ~45% Hadoop Jobs on Production Clusters
• M/R + Oozie = 41%
Hive for Ad hoc queries
› SQL
› Relatively smaller number of jobs
• *Major* Uptick
Use HCatalog for Inter-op
6. 6 Yahoo Confidential & Proprietary
Hive is Currently the Fastest Growing Product on the Grid
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
9.0%
10.0%
0
5
10
15
20
25
30
Mar-13 Apr-13 May-13 Jun-13 Jul-13 Aug-13 Sep-13 Oct-13 Nov-13 Dec-13 Jan-14 Feb-14 Mar-14 Apr-14 May-14
HiveJobs(%ofAllJobs)
AllGridJobs(inMillions)
All Jobs Hive (% of all jobs)
2.4 million
Hive jobs
7. Business Intelligence Tools
7
{Tableau, MicroStrategy, Excel, … }
Challenges:
› Security
• ACLs, Authentication, Encryption over the wire, Full-disk Encryption
› Bandwidth
• Transporting results over ODBC
› Query Latency
• Query execution time
• Cost of query “optimizations”
• “Bad” queries
2014 Hadoop Summit, San Jose, California
8. The Benchmark
8
TPC-h
› Industry standard (tpc.org/tpch)
› 22 queries
› dbgen –s 1000 –S 3
• Parallelizable
Reynold Xin’s excellent work:
› https://github.com/rxin
› Transliterated queries to suit Hive 0.9
2014 Hadoop Summit, San Jose, California
9. Relational Diagram
9 2014 Hadoop Summit, San Jose, California
PARTKEY
NAME
MFGR
BRAND
TYPE
SIZE
CONTAINER
COMMENT
RETAILPRICE
PARTKEY
SUPPKEY
AVAILQTY
SUPPLYCOST
COMMENT
SUPPKEY
NAME
ADDRESS
NATIONKEY
PHONE
ACCTBAL
COMMENT
ORDERKEY
PARTKEY
SUPPKEY
LINENUMBER
RETURNFLAG
LINESTATUS
SHIPDATE
COMMITDATE
RECEIPTDATE
SHIPINSTRUCT
SHIPMODE
COMMENT
CUSTKEY
ORDERSTATUS
TOTALPRICE
ORDERDATE
ORDER-
PRIORITY
SHIP-
PRIORITY
CLERK
COMMENT
CUSTKEY
NAME
ADDRESS
PHONE
ACCTBAL
MKTSEGMENT
COMMENT
PART (P_)
SF*200,000
PARTSUPP (PS_)
SF*800,000
LINEITEM (L_)
SF*6,000,000
ORDERS (O_)
SF*1,500,000
CUSTOMER (C_)
SF*150,000
SUPPLIER (S_)
SF*10,000
ORDERKEY
NATIONKEY
EXTENDEDPRICE
DISCOUNT
TAX
QUANTITY
NATIONKEY
NAME
REGIONKEY
NATION (N_)
25
COMMENT
REGIONKEY
NAME
COMMENT
REGION (R_)
5
10. The Setup
10
› 350 Node cluster
• Xeon boxen: 2 Slots with E5530s => 16 CPUs
• 24GB memory
– NUMA enabled
• 6 SATA drives, 2TB, 7200 RPM Seagates
• RHEL 6.4
• JRE 1.7 (-d64)
• Hadoop 0.23.7+/2.3+, Security turned off
• Tez 0.3.x
• 128MB HDFS block-size
› Downscale tests: 100 Node cluster
• hdfs-balancer.sh
2014 Hadoop Summit, San Jose, California
11. The Prep
11
Data generation:
› Text data: dbgen on MapReduce
› Transcode to RCFile and ORC: Hive on MR
• insert overwrite table orc_table partition( … ) select * from text_table;
› Partitioning:
• Only for 1TB, 10TB cases
• Perils of dynamic partitioning
› ORC File:
• 64MB stripes, ZLIB Compression
2014 Hadoop Summit, San Jose, California
14. 100 GB
14
› 18x speedup over Hive 0.10 (Textfile)
• 6-50x
› 11.8x speedup over Hive 0.10 (RCFile)
• 5-30x
› Average query time: 28 seconds
• Down from 530 (Hive 0.10 Text)
› 85% queries completed in under a minute
2014 Hadoop Summit, San Jose, California
16. 1 TB
16
› 6.2x speedup over Hive 0.10 (RCFile)
• Between 2.5-17x
› Average query time: 172 seconds
• Between 5-947 seconds
• Down from 729 seconds (Hive 0.10 RCFile)
› 61% queries completed in under 2 minutes
› 81% queries completed in under 4 minutes
2014 Hadoop Summit, San Jose, California
18. 10 TB
18
› 6.2x speedup over Hive 0.10 (RCFile)
• Between 1.6-10x
› Average query time: 908 seconds (426 seconds excluding outliers)
• Down from 2129 seconds with Hive 0.10 RCFile
– (1712 seconds excluding outliers)
› 61% queries completed in under 5 minutes
› 71% queries completed in under 10 minutes
› Q6 still completes in 12 seconds!
2014 Hadoop Summit, San Jose, California
19. Explaining the speed-ups
19
Hadoop 2.x, et al.
Tez
› (Arbitrary DAG)-based Execution Engine
› “Playing the gaps” between M&R
• Temporary data and the HDFS
› Feedback loop
› Smart scheduling
› Container re-use
› Pipelined job start-up
Hive
› Statistics
› “Vector-ized” Execution
ORC
› PPD
2014 Hadoop Summit, San Jose, California
20. 20 2014 Hadoop Summit, San Jose, California
0
100
200
300
400
500
600
700
800
900
1000
q1_pricing_sum
mary_report.hive
q2_m
inim
um
_cost_supplier.hive
q3_shipping_priority.hiveq4_order_priority
q5_local_supplier_volume.hive
q6_forecast_revenue_change.hive
q7_volume_shipping.hive
q8_na
onal_m
arket_share.hive
q9_product_type_profit.hive
q10_returned_item
.hive
q11_im
portant_stock.hiveq12_shipping.hive
q13_customer_distribu
on.hive
q14_promo
on_effect.hive
q15_top_supplier.hive
q16_parts_supplier_rela
onship.hive
q17_small_quan
ty_order_revenue.hive
q18_large_volum
e_customer.hive
q19_discounted_revenue.hive
q20_poten
al_part_prom
o
on.hive
q21_suppliers_who_kept_orders_waing.hive
q22_global_sales_opportunity.hive
Time(inseconds)
Vectoriza on
Hive 0.13 Tez ORC
Hive 0.13 Tez ORC Vec
21. 21 2014 Hadoop Summit, San Jose, California
ORC File Layout
Data is composed of multiple streams per
column
Index allows for skipping rows (default to
every 10,000 rows), keeping position in
each stream, and min-max for each
column
Footer contains directory of stream
locations, and the encoding for each
column
Integer columns are serialized using run-
length encoding
String columns are serialized using
dictionary for column values, and the
same run length encoding
Stripe footer is used to find the requested
column’s data streams and adjacent
stream reads are merged File Footer
Postscript
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Stream 2.1
Stream 2.2
Stream 2.3
Stream 2.4
22. 22 2014 Hadoop Summit, San Jose, California
ORC Usage
CREATE TABLE addresses (
name string,
street string,
city string,
state string,
zip int
)
STORED AS orc TBLPROPERTIES ("orc.compress"= "ZLIB");
LOCATION ‘/path/to/addresses’;
ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT orc
SET hive.default.fileformat = orc
SET hive.exec.orc.memory.pool = 0.50 (ORC writer is allowed 50% of JVM heap size by default)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde’
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';
Key Default Comments
orc.compress ZLIB high-level compression (one of NONE, ZLIB, Snappy)
orc.compress.size 262,144 (256 KB) number of bytes in each compression chunk
orc.stripe.size 67,108,864 (64 MB) number of bytes in each stripe. Each ORC stripe is processed in one map task (try 32
MB to cut down on disk I/O)
orc.row.index.stride 10,000 number of rows between index entries (must be >= 1,000). A larger stride-size
increases the probability of not being able to skip the stride, for a predicate.
orc.create.index true whether to create row indexes. This is for predicate push-down. If data is frequently
accessed/filtered on a certain column, then sorting on the column and using index-filters
makes column filters work faster
25. Configuring ORC
25
set hive.merge.mapredfiles=true
set hive.merge.mapfiles=true
set orc.stripe.size=67,108,864
› Half the HDFS block-size
• Prevent cross-block stripe-read
• Tangent: DistCp
set orc.compress=???
› Depends on size and distribution
› Snappy compression hasn’t been explored
YMMV
› Experiment
2014 Hadoop Summit, San Jose, California
26. 26 2014 Hadoop Summit, San Jose, California
0
100
200
300
400
500
600
700
800
900
1000
q1_pricing_sum
m
ary_report.hive
q2_m
inim
um
_cost_supplier.hive
q3_shipping_priority.hive
q4_order_priority
q5_local_supplier_volum
e.hive
q6_forecast_revenue_change.hive
q7_volum
e_shipping.hive
q8_na
onal_m
arket_share.hive
q9_product_type_profit.hive
q10_returned_item
.hive
q11_im
portant_stock.hive
q12_shipping.hive
q13_custom
er_distribu
on.hive
q14_prom
o
on_effect.hive
q15_top_supplier.hive
q16_parts_supplier_rela
onship.hive
q17_sm
all_quan
ty_order_revenue.hive
q18_large_volum
e_custom
er.hive
q19_discounted_revenue.hive
q20_poten
al_part_prom
o
on.hive
q21_suppliers_w
ho_kept_orders_w
ai
ng.hive
q22_global_sales_opportunity.hive
Time(inseconds)
100 vs 350 Nodes
Hive 0.13 100 Nodes
Hive 0.13 350 Nodes
28. Y!Grid sticking with Hive
28
Familiarity
› Existing ecosystem
Community
Scale
Multitenant
Coming down the pike
› CBO
› In-memory caching solutions atop HDFS
• RAMfs a la Tachyon?
2014 Hadoop Summit, San Jose, California
29. We’re not done yet
29
SQL compliance
Scaling up the metastore
performance
Better BI Tool integration
Faster transport
› HiveServer2 result-sets
2014 Hadoop Summit, San Jose, California
30. References
30
The YDN blog post:
› http://yahoodevelopers.tumblr.com/post/85930551108/yahoo-betting-on-apache-hive-
tez-and-yarn
Code:
› https://github.com/mythrocks/hivebench (TPC-h scripts, datagen, transcode utils)
› https://github.com/t3rmin4t0r/tpch-gen (Parallel TPC-h gen)
› https://github.com/rxin/TPC-H-Hive (TPC-h scripts for Hive)
› https://issues.apache.org/jira/browse/HIVE-600 (Yuntao’s initial TPC-h JIRA)
2014 Hadoop Summit, San Jose, California
33. Sharky comments
33
Testing with Shark 0.7.x and Shark 0.8
› Compatible with Hive Metastore 0.9
› 100GB datasets : Admirable performance
› 1TB/10TB: Tests did not run completely
• Failures, especially in 10TB cases
• Hangs while shuffling data
• Scaled back to 100 nodes -> More tests ran through, but not completely
› nReducers: Not inferred
Miscellany
› Security
› Multi-tenancy
› Compatibility
2014 Hadoop Summit, San Jose, California
Editor's Notes
Gopal was supposed to be presenting this with me, to talk about Tez. Point to Gopal/Jitendra’s talk on Hive/Tez for details on things I’ll have to skim over.
Also, acknowledge Thomas Graves, who’s talking today about the excellent work he’s doing on driving Spark on Yarn.
There are several sides to query latency:
Query execution time : Addressed in the physical query-execution layer.
Query optimizations: The first step while optimizing the query plan seems to be to query for all partition instances. Very expensive for “Project Benzene”.
Bad queries : Tableau, I’m looking at you.
The Transaction Processing Performance Council (inexplicably abbreviated to TPC) suggests a set of benchmarks for query processing. Many have adopted TPC-DS to showcase performance. We chose TPC-h to complement. (Also, 22 much smaller number to deal with than… 90?)
Transliteration: Evita and Kylie Minogue
Lineitem and Orders are extremely large Fact tables. Nation and Region are the smallest dimension tables.
Tangent: Funny story:
1. About hard-drives: Can set up MR intermediate directories and HDFS data-node directories to be on different disks. Traffic from one doesn’t affect the other. But on the other hand, total read bandwidth might be reduced.
Line-item: Partitioned on Ship-date.
Orders: Order-date
Customers: By market-segment
Suppliers: On their region-key.
Q5 and q21 are anomalous.
Q21: Hit a trailing reducer across all versions of Hive tested. Perhaps this can be improved with a better plan.
Q5: Slow reducer that hit only Hive 13. Could be a bad plan. Could be a difference in data distribution when data was regenerated for Hadoop 2 cluster.
Tez : Scheduling. Playing the gaps, like Beethoven’s Fifth.
Vectorization: On average: 1.2x.
Except for a few outliers, ZLIB compression actually reduced performance for a 1TB dataset. Uncompressed was 1.3x faster than Compressed.
The situation reverses at the 10 TB level. The gains from decompression are actually offset by the disk-read time.
The long-tail in 10TB/q21 threw the scale of the graph off, so I’ve excluded it in the results.
Talk about file-coalesce, small-file generation, Namenode pressure and parallelism.
You don’t want to read an ORC stripe from a different node.
Talk about distcp –pgrub, for ORC files.
Mention that SNAPPY’s license is not Apache.
Also, Yoda.
At 100 nodes, it performs at 0.9x the 350 node performance.
We’ve seen Hive and Tez scale down for latency, scale up for data-size, and scale out across larger clusters.
Familiarity : We have an existing ecosystem with Hive, HCatalog, Pig and Oozie that delivers revenue to Yahoo today. It’s hard to rock the boat.
Community: The Apache Hive community is large, active and thriving. They’ve been solving issues with query latency for ages now. The switch to using the Tez execution engine was a solution within the Apache Hive project. This wasn’t a fork of Hive. This is Hive, proper.
Scale: We’ve seen Hive and Tez perform at scale. Heck, we’ve seen Pig perform on Tez.
Multitenant: Yahoo’s use-case is unique, and not just because of data-scale. There’s hundreds of active users and genuine multitenancy and security concerns.
Design: We think the Hive community has tackled the right problems first, rather than throw RAM at the problem.
Bucky Lasek at the X-Games in 2001. Notice where he’s looking… Not at the camera, but setting up his next trick.
Security: Kerberos support was patched in, after the benchmarks were run.
Multi-tenancy:
Data needs to be explicitly pinned into memory as RDDs.
In a multi-tenant system, how would pinning work? Eviction policy for data.
Compatibility:
Needs to work with Metastore versions 12 and 13. Shark’s gone to 0.11 just recently.
Integration with the rest of the stack: Oozie and Pig.
Overall, we wanted a solution that works with high-dynamic range. i.e. works well with small datasets (100s of GBs), as well as scale to multi-terabyte datasets. We have a familiar system that seems to fit that bill. It doesn’t quite rock the boat. It’s not perfect yet. There are bugs that we’re working on. And we still haven’t solved the problem of data-volume/BI.
By the way, I really like the idea of BlinkDB. I saw the JIRA.