Successfully reported this slideshow.
Your SlideShare is downloading. ×

Scalable HiveServer2 as a Service

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Scaling HDFS at Xiaomi
Scaling HDFS at Xiaomi
Loading in …3
×

Check these out next

1 of 30 Ad

Scalable HiveServer2 as a Service

Download to read offline

HiveServer2 provides a multi-tenant service end-point for executing Hive queries concurrently. It provides support for authentication and authorization, serves as a JDBC endpoint for users to connect and run queries via various tools, maintains sessions and warm containers for faster query processing, provides caching at multiple levels and much more. In other words, it is an integral component of any Hive deployment. HiveServer2 deployments however often face performance and reliability issues leading to catastrophic failures at times. At Qubole, we have augmented HiveServer2 to utilize the capabilities of the cloud to offer an enterprise-ready scalable and stable HiveServer2 (or HS2) service.

The HS2 experience on the cloud at Qubole, which is our primary platform of deployment, has been enhanced to automatically scale based on the customer’s workload; our solution adds and gracefully removes HS2 instances according to the requirement, thus making HS2 service not only self-sufficient at scale but also fault-tolerant. We have implemented Load Balancing for queries based on the resource utilization on HS2 instances to provide a reliable, efficient and cost-effective solution. A health monitoring service, based on past learnings and insights of running HS2 in customer deployments, implemented on top of this scalable HS2 service acts as the foundation for battle-tested, enterprise-ready solution for HS2 instances. In this talk, we will share the details of such an implementation, and the challenges faced in providing an auto-scalable, highly performant and reliable HS2 experience in the cloud.

Topics include:

* Workload-aware autoscaling for HS2 clusters.

* Agent-based adaptive load balancing of Hive queries on multi-tenant HS2 clusters.

* Durability monitoring using failure semantics and automated measures to provide reliability.

* Enterprise level security for HS2 on the cloud.

* Metrics, monitoring and alerting around the HS2 service.

HiveServer2 provides a multi-tenant service end-point for executing Hive queries concurrently. It provides support for authentication and authorization, serves as a JDBC endpoint for users to connect and run queries via various tools, maintains sessions and warm containers for faster query processing, provides caching at multiple levels and much more. In other words, it is an integral component of any Hive deployment. HiveServer2 deployments however often face performance and reliability issues leading to catastrophic failures at times. At Qubole, we have augmented HiveServer2 to utilize the capabilities of the cloud to offer an enterprise-ready scalable and stable HiveServer2 (or HS2) service.

The HS2 experience on the cloud at Qubole, which is our primary platform of deployment, has been enhanced to automatically scale based on the customer’s workload; our solution adds and gracefully removes HS2 instances according to the requirement, thus making HS2 service not only self-sufficient at scale but also fault-tolerant. We have implemented Load Balancing for queries based on the resource utilization on HS2 instances to provide a reliable, efficient and cost-effective solution. A health monitoring service, based on past learnings and insights of running HS2 in customer deployments, implemented on top of this scalable HS2 service acts as the foundation for battle-tested, enterprise-ready solution for HS2 instances. In this talk, we will share the details of such an implementation, and the challenges faced in providing an auto-scalable, highly performant and reliable HS2 experience in the cloud.

Topics include:

* Workload-aware autoscaling for HS2 clusters.

* Agent-based adaptive load balancing of Hive queries on multi-tenant HS2 clusters.

* Durability monitoring using failure semantics and automated measures to provide reliability.

* Enterprise level security for HS2 on the cloud.

* Metrics, monitoring and alerting around the HS2 service.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Scalable HiveServer2 as a Service (20)

Advertisement

More from DataWorks Summit (20)

Advertisement

Scalable HiveServer2 as a Service

  1. 1. Scalable HiveServer2 as a service Shreya Bhatia Nitin Khandelwal
  2. 2. 00Copyright 2017 © Qubole Agenda ● Hive as a service ● HiveServer2 as a service ● Issues faced in HiveServer2 deployments ● Proposed solution ● Status and Future Work
  3. 3. 00Copyright 2017 © Qubole Built for Anyone who Uses Data Analysts l Data Scientists l Data Engineers l Data Admins Optimize performance, cost, and scale through automation, control and orchestration of big data workloads. A Single Platform for Any Use Case ETL & Reporting l Ad Hoc Queries l Machine Learning l Streaming l Vertical Apps Open Source Engines, Optimized for the Cloud Native Integration with multiple cloud providers
  4. 4. 00Copyright 2017 © Qubole ● QDS is a multi-tenant platform. ● Manages lifetime of multiple clusters. ● Numerous highly scalable tier of nodes, QDS control plane. ● Clusters are auto-scalable based on workload. Qubole Data Service (QDS)
  5. 5. 00Copyright 2017 © Qubole Hive as a service Scale Up Scale Down
  6. 6. 00Copyright 2017 © Qubole ● Hive JVM’s resource requirement is proportional to query/metadata ● Hadoop Cluster auto-scales based on the workload. ● Hive Client JVMs runs on a highly-scalable, multi-tenant tier on the control plane. ○ Fork of Eureka project (by Netflix) is used for service discovery, autoscaling and load balancing. Hive as a service (Contd..)
  7. 7. 00Copyright 2017 © Qubole Thrift-based service that enables clients to execute queries against Hive. ● Supports multi-client concurrency, authentication and authorization ● Better support for open API clients like JDBC and ODBC. ● Improves query performance (Container reuse, Tez session reuse, etc) ● Provides Caching at multiple levels. HiveServer2
  8. 8. 00Copyright 2017 © Qubole Scale Up Scale Down HiveServer2 as a service
  9. 9. 00Copyright 2017 © Qubole ● Single point of failure ● Bottleneck as Hadoop cluster auto-scales ● Admin struggles in determining optimal memory settings ● Lacks isolation ● Potential Memory Leaks HiveServer2 as a service : Shortcomings
  10. 10. 00Copyright 2017 © Qubole Proposed Solution: HiveServer2 Cluster
  11. 11. 00Copyright 2017 © Qubole ● Zookeeper HS2 instances running on worker nodes register under a zookeeper namespace to make themselves discoverable. ● Megamind Stateless service responsible for load balancing, autoscaling and monitoring HS2 Cluster. HiveServer2 Cluster (Contd..)
  12. 12. 00Copyright 2017 © Qubole HiveServer2 Cluster: Query flow
  13. 13. 00Copyright 2017 © Qubole ● Horizontal Scaling ○ Reduces GC issues. ● High Availability ● Easy to configure. ● Achieves limited Isolation. HiveServer2 Cluster: Advantages
  14. 14. 00Copyright 2017 © Qubole ● HS2 performance degrades over time due to Native/Heap memory leaks, GC issues, etc. ● Fix Leaks ? ○ New leaks do sneak-in over releases. ○ Native memory leaks are difficult to catch/fix. ○ Needs a new release/deploy for changes to take effect. Workaround: Reduce HS2 JVM life-span ? HiveServer2: Issues due to long running JVM
  15. 15. 00Copyright 2017 © Qubole ● Restart HS2 JVM periodically ? ○ Need to take care of Running queries ○ Reduced cluster capacity ● Recycle old nodes with fresh ones after a fixed interval. ○ Upscale first ○ Bring node out of rotation ■ Decommissioning state like Datanode. HiveServer2 Cluster: Periodic Rotation of worker nodes
  16. 16. 00Copyright 2017 © Qubole HiveServer2 Cluster: Periodic Rotation (Contd..)
  17. 17. 00Copyright 2017 © Qubole HiveServer2 Cluster: Periodic Rotation (Contd..)
  18. 18. 00Copyright 2017 © Qubole HiveServer2 Cluster: Periodic Rotation (Contd..)
  19. 19. 00Copyright 2017 © Qubole HiveServer2 Cluster: Periodic Rotation (Contd..)
  20. 20. 00Copyright 2017 © Qubole Downside: - Decommissioning state can last really long. - Increased cost. - State management overhead. HiveServer2 Cluster: Periodic Rotation (Contd..)
  21. 21. 00Copyright 2017 © Qubole - Megamind decides to upscale/downscale HS2 cluster HiveServer2 Cluster: Autoscaling Metric Definition Healthy Nodes Total Healthy nodes in cluster. Load on HiveServer2 cluster Number of Running/Waiting queries, CPU, Memory, etc Buffer Capacity Number of free worker threads to take up new queries Minimum Worker Nodes Minimum worker nodes configured Maximum Worker Nodes Maximum worker nodes configured
  22. 22. 00Copyright 2017 © Qubole Metric Name Metric Definition CPU Usage CPU load on HS2 Node. Memory Usage Heap/Native memory consumed by HS2. Free Worker Threads Number of free worker threads available for new queries. GC Pauses Time spent in Full GC pauses Query Failure rate Number of query failures State of HS2 JVM Running, De-commissioning, Booting, etc. HiveServer2 Cluster: Autoscaling - Determining health of node
  23. 23. 00Copyright 2017 © Qubole ● Maintain buffer capacity in cluster. ● Unhealthy nodes are brought out of rotation (temporarily) ○ Cluster upscales to maintain capacity. ● Additional states like booting and decommissioning maintained in Zookeeper ● Easy to upscale aggressively. ● Aggressive downscaling is relatively hard - queries can run for a longer time. HiveServer2 Cluster: Autoscaling
  24. 24. 00Copyright 2017 © Qubole ● Random distribution ● Relay queries based on the state and runtime metrics of HS2 nodes ● Pick node with least load ? ○ Higher decommissioning time and cost ● Load Balancer maintains following order while forwarding requests ○ Node with average load ○ Nodes with low load ○ Node with High load (Raise Alarm) HiveServer2 Cluster: Load Balancing
  25. 25. 00Copyright 2017 © Qubole HiveServer2 Cluster: Failure Recovery HiveServer2 ○ Load distributed among other nodes, cluster upscales ○ Beeline retries on a different node. Zookeeper ○ Megamind caches state of HS2 daemons until Zookeeper recovers Megamind ○ Upcoming queries are randomly distributed among healthy nodes - Zookeeper ○ No Autoscaling until Megamind restarts
  26. 26. 00Copyright 2017 © Qubole ● HiveServer2 uses codahale based metric system for capturing JVM and application level metrics. ● Monitoring based on metrics collected from HS2 JVM ● Automated Health Checks / Corrective actions - Megamind ● Dashboard for distributed HS2 monitoring. ● Alerts for critical failures HiveServer2 Cluster: Monitoring
  27. 27. 00Copyright 2017 © Qubole Summary ● Issues faced by HiveServer2 deployments ● HiveServer2 Cluster ● Periodic Rotation of HS2 nodes ● Autoscaling in HS2 cluster ● Durability, Failure Recovery and monitoring.
  28. 28. 00Copyright 2017 © Qubole Current Status and Future Work ● Current Status ○ HiveServer2 cluster is in Beta. ● Future Work ○ Predictive Auto-Scaling ○ Better Load Balancing ○ Recommissioning
  29. 29. 00Copyright 2017 © Qubole ● https://go.qubole.com/CA---WP---Workload-Aware-Auto-Scaling_Landing- Page.html ● https://github.com/Netflix/eureka ● https://github.com/dropwizard/dropwizard ● HIVE-19821 ● TEZ-3991 ● HIVE-7935 References
  30. 30. 00Copyright 2017 © Qubole Questions ?

×