Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

October 2014 HUG : Hive On Spark

4,551 views

Published on

October 2014 HUG : Hive On Spark

Published in: Technology
  • Be the first to comment

October 2014 HUG : Hive On Spark

  1. 1. 1 Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY Hive on Spark PRIOR TO 10/23/12 Szehon Ho Software Engineer at Cloudera, Apache Hive Committer October 2014
  2. 2. Background (Hive) • Apache Hive: a data query and management tool for a distributed dataset, exposed via a SQL-like query language called HiveQL 2
  3. 3. Background (Hive) • 2007-2013, MapReduce = only distributed processing engine • Map(), Reduce() primitives, not designed for long data pipelines • Complex SQL-like queries inefficiently expressed as many MR stages. • Disk IO between MR’s • Shuffle-sort between M+R 3 Map() Red() Hive Query Map() Red() Map() Red() HDFS
  4. 4. Background (Hive) • 2013 Hive Community started work on Hive on Tez • Tez DAG execution graph 4 Hive Query Map() Red() Map() Red() Red() HDFS
  5. 5. Background (Spark) • Generalized distributed processing framework created in ~2011 by UC Berkeley AMPLab • Many advantages (community, ease-of-use), heading to succeed MapReduce 5
  6. 6. Background (Spark) • Community Momentum: • Already the most active project in Hadoop ecosystem • June 2014: 255 contributors from 50 companies • First half of 2014: ~1200 commits, 250000 LOC changed • Integration from with many Hadoop components, ie Pig, Flume, Mahout, Crunch, Solr, now Hive. 6
  7. 7. Background (Spark) 7 • Clean programming abstraction: Resilient Distributed Dataset (RDD): • A fault-tolerant dataset, can be a stage in a data pipeline. • Created from existing data set like HDFS file, or transformation from other RDD (chain-up RDD’s) • Expressive API’s, much more than MapReduce • Transformations: map, filter, groupBy • Actions: cache, save • => More efficient representation of Hive queries
  8. 8. Hive on Spark 8 • Shark Project: • AMPLab github project, fork of Hive • Not maintained by Hive community, sunsetted 2014 • Hive on Spark: • Done in Hive community • Architecturally compatible, by keeping same physical abstraction for Hive on Spark as Hive on Tez/MR. • Code maintenance • Maximize re-use of common functionality across execution engine
  9. 9. Hive on Spark 9 • Hive on Spark, User Benefits • Another seamless execution option (MR, Tez, Spark) • Leverage Spark clusters coming in use for ML, Graph Processing, Streaming, etc. • Continued efficiency, performance improvements via strong Spark community.
  10. 10. High-Level Design Common across engines: • HQL syntax • Tool Integrations (auditing plugins, authorization, Drivers, Thrift clients, UDF, StorageHandler) • Logical optimizations MapRedCompiler TezCompiler SparkCompiler 10 Hive Query Logical Op Tree Task TaskCompiler Work MapRedTask MapRedWork TezTask SparkTask MapRedWork TezWork TezWork SparkWk TezWork SparkWk SparkWk
  11. 11. Simple Example 11 SELECT COUNT(*) from status_updates where ds = ‘2014-10-01’ group by region; TableScan (status_updates) Filter (ds=‘2014 10-01’) Select (region) Group-By (count) Select Hive Query: Operator Tree: GBY trigger reduce-boundary:
  12. 12. Simple Example 12 Reducer GroupBy Select FileOutput Mapper TableScan Filter Select Group-By ReduceSink MapRed Work Tree • Map->Reduce ShuffleSort
  13. 13. Simple Example 13 mapPartition() GroupBy Select FileOutput mapPartition() TableScan Filter Select Group-By ReduceSink Spark Work Tree: • RDD Chain No sorting groupBy()
  14. 14. Join Example 14 TableScan Filter Select Join Select Sort Select TableScan Filter Select SELECT * FROM (SELECT * FROM src WHERE src.key < 10) src1 JOIN (SELECT * FROM src WHERE src.key < 10) src2 ORDER BY src1.key; • Operator Tree: • Join/Sort trigger Reduce boundary Hive Query:
  15. 15. Join Example 15 MapRed Work Tree • 2 MapReduce Works ShuffleSort ShuffleSort Map TableScan ReduceSink (Sort) Map TableScan Filter Select Reduce Sink Reduce Join Select FileOutput Reduce Select FileOutput Map TableScan Filter Select Reduce Sink Disk IO HDFS
  16. 16. Join Example 16 No spill to disk mapPartition() Join Select Reduce Sink mapPartition() Select FileOutput union() Partition/ Sort() sortBy() mapPartition() TableScan Filter Select Reduce Sink mapPartition() TableScan Filter Select Reduce Sink Spark Work Tree: RDD Transform Chain
  17. 17. Improvements to Spark 17 • Reduce-side join: SPARK-2978 • Spark had group(), sort(), but not partition+sort like MR-style shuffle-sort. • Can help other apps migrate from Map-Reduce to Spark • Remote Spark-context (push down to AM) • SparkContext is not allowed concurrently in client application process. • SparkContext is heavy-weight • Spark Monitoring API’s • Elastic scaling of Spark application: SPARK-3174
  18. 18. Community 18 • Thanks to contributors from many organizations: • Follow our progress on HIVE-7292 • Thank you!

×