Presented by Xuefu Zhang during the August 2017 Hive User Group Meeting. You can view the live stream of the meetup here: https://www.youtube.com/watch?v=L0nGKKjqdDs
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Hive on Spark, production experience @Uber
1. Hive on Spark - Production Experience
@Uber
Xuefu Zhang, Staff Engineer, Data Infra
2. Outline
● Hive at Uber
● Current Status
● Issues
● Future Work
● Conclusions
● Q&A
3. Hive at Uber
● Hundreds of active users daily
● Over 20K queries per day
● P50 - P90 execution time 2min - 20min
● Used for ETL and data analytics
● MR + Tez + Spark
4. Hive at Uber (cont’d)
● Efficiency is top priority
● Cluster operates at capacity
● Faster data, faster ETL
● Technology/operations/expertise consolidations
5. Why Hive on Spark
● Significantly less disk IO on HDFS
● Utilize memory for better performance
● Higher success rate with Uber’s workload
● Better supportability, observability, and UI
● Spark is widely adopted in your infrastructure
6. Why Hive on Spark (cont’d)
● On average 2X performance improvement
● On average 1.5X efficiency improvement
● Significantly reduce RPC calls to HDFS namenode (5X)
● Significantly reduce temp disk space on HDFS (10X)
7. Current Status
● By H1 2017,
○ All ad-hoc queries are on Hive on Spark
○ 15% ETL pipelines are migrated
○ Current Hive traffic breakdown: 50% MR, 40% Spark, 10% Tez
● By H2 2017
○ All workload are on Hive on Spark
○ MR usage will be exceptional
8. Issues
● Infrastructural issues
○ IPv4 & IPv6 (not to mix)
○ Network timeout (spark.network.timeout=800s)
○ Try to keep homogeneous nodes in the cluster
● Spark dynamic allocation issues
○ Backported many patches to Spark 1.6
○ spark.dynamicAllocation.maxExecutors=2000
9. Issues
● Hive issues
○ Unbounded memory usage for orderBy
○ Concurrency issues related to static variables
○ Spark executor and driver memory settings
○ Hive RPC server and client connection problems
10. Issues (cont’d)
○ Stats-related issues
■ Missing/inaccurate stats
■ No stats for nest columns
○ Performance issues
■ MapJoin small table size
■ Operator stats used for mapjoin
11. Issues (cont’d)
● Other Spark issues
○ Spark driver performance
○ Spark event queue size
○ Unbounded memory usage for groupby
○ Spark history server
12. Configurations (cont’d)
● Some of our configurations
spark.scheduler.listenerbus.eventqueue.size=50000
hive.spark.client.connect.timeout=5s
hive.spark.client.server.connect.timeout=1h
spark.locality.wait=0s
hive.spark.use.op.stats=false
hive.spark.use.file.size.for.mapjoin=true
16. Future Work (cont’d)
● Improve Hive
○ Stats support for nested columns
○ Predicate pushdown for nested columns
○ Dynamic partition pruning
○ Full vectorization
○ Optimizations that currently only work for Tez
17. Conclusions
● HoS helps us on query performance and resource efficiency
● HoS significant reduces load on HDFS
● HoS helps us consolidate technologies
● Migration to HoS is fairly straight forward and transparent for most users
● However, there are catches in deployment and production
● More effort is on the way