Hive on Spark, production experience @Uber

Hive on Spark - Production Experience
@Uber
Xuefu Zhang, Staff Engineer, Data Infra

Outline
● Hive at Uber
● Current Status
● Issues
● Future Work
● Conclusions
● Q&A

Hive at Uber
● Hundreds of active users daily
● Over 20K queries per day
● P50 - P90 execution time 2min - 20min
● Used for ETL and data analytics
● MR + Tez + Spark

Hive at Uber (cont’d)
● Efficiency is top priority
● Cluster operates at capacity
● Faster data, faster ETL
● Technology/operations/expertise consolidations

Why Hive on Spark
● Significantly less disk IO on HDFS
● Utilize memory for better performance
● Higher success rate with Uber’s workload
● Better supportability, observability, and UI
● Spark is widely adopted in your infrastructure

Why Hive on Spark (cont’d)
● On average 2X performance improvement
● On average 1.5X efficiency improvement
● Significantly reduce RPC calls to HDFS namenode (5X)
● Significantly reduce temp disk space on HDFS (10X)

Current Status
● By H1 2017,
○ All ad-hoc queries are on Hive on Spark
○ 15% ETL pipelines are migrated
○ Current Hive traffic breakdown: 50% MR, 40% Spark, 10% Tez
● By H2 2017
○ All workload are on Hive on Spark
○ MR usage will be exceptional

Issues
● Infrastructural issues
○ IPv4 & IPv6 (not to mix)
○ Network timeout (spark.network.timeout=800s)
○ Try to keep homogeneous nodes in the cluster
● Spark dynamic allocation issues
○ Backported many patches to Spark 1.6
○ spark.dynamicAllocation.maxExecutors=2000

Issues
● Hive issues
○ Unbounded memory usage for orderBy
○ Concurrency issues related to static variables
○ Spark executor and driver memory settings
○ Hive RPC server and client connection problems

Issues (cont’d)
○ Stats-related issues
■ Missing/inaccurate stats
■ No stats for nest columns
○ Performance issues
■ MapJoin small table size
■ Operator stats used for mapjoin

Issues (cont’d)
● Other Spark issues
○ Spark driver performance
○ Spark event queue size
○ Unbounded memory usage for groupby
○ Spark history server

Configurations (cont’d)
● Some of our configurations
spark.scheduler.listenerbus.eventqueue.size=50000
hive.spark.client.connect.timeout=5s
hive.spark.client.server.connect.timeout=1h
spark.locality.wait=0s
hive.spark.use.op.stats=false
hive.spark.use.file.size.for.mapjoin=true

Configurations (cont’d)
hive.spark.job.max.tasks=200000
hive.spark.stage.max.tasks=80000
spark.dynamicAllocation.initialExecutors=5
spark.executor.cores=5
spark.executor.memory=7168m
spark.yarn.executor.memoryOverhead=3072
spark.shuffle.manager=tungsten-sort
You may need to figure out what configurations work best for your cluster!

Future Work
● Global collaboration
○ Uber
○ Intel
○ Cloudera
○ Freelance contributors in the community

Future Work (cont’d)
● Improve Spark
○ Dynamic allocation
○ Driver performance
○ Resource efficiency

Future Work (cont’d)
● Improve Hive
○ Stats support for nested columns
○ Predicate pushdown for nested columns
○ Dynamic partition pruning
○ Full vectorization
○ Optimizations that currently only work for Tez

Conclusions
● HoS helps us on query performance and resource efficiency
● HoS significant reduces load on HDFS
● HoS helps us consolidate technologies
● Migration to HoS is fairly straight forward and transparent for most users
● However, there are catches in deployment and production
● More effort is on the way

Thank you
Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be
reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or
by any information storage or retrieval systems, without permission in writing from Uber. This document is intended
only for the use of the individual or entity to whom it is addressed and contains information that is privileged,
confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified
that the information contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person
other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.
Business Development Lead at Uber
+1. 415.237.5555
doreipwehociwjcioreoicnrm@uber.com

Q&A
We are hiring: https://www.uber.com/careers/list/27366/
Contact: abhik@uber.com, xuefu@uber.com

Hive on Spark, production experience @Uber

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hive on Spark, production experience @Uber

Similar to Hive on Spark, production experience @Uber (20)

Recently uploaded

Recently uploaded (20)

Hive on Spark, production experience @Uber