Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hive @ Bucharest Java User Group


Published on

Introduction to Hive

Published in: Software, Technology
  • Be the first to comment

Hive @ Bucharest Java User Group

  1. 1. HIVE Bucharest Java User Group July 3, 2014
  2. 2. whoami • Developer with SQL Server team since 2001 • Apache contributor • Hive • Hadoop core (security) • stackoverflow user 105929s • @rusanu
  3. 3. What is HIVE • Datawarehouse for querying and managing large datasets • A query engine that use Hadoop MapReduce for execution • A SQL abstraction for creating MapReduce algorithms • SQL interface to HDFS data • Developed at Facebook VLDB 2009: Hive - A Warehousing Solution Over a Map-Reduce Framework • ASF top project since September 2010
  4. 4. What is Hadoop Hadoop Core • Distributed execution engine • MapReduce • YARN • TEZ • Distributed File System HDFS • Tools for administering the execution engine and HDFS • Libraries for writing MapReduce jobs Hadoop Ecosystem • HBase (BigTable) • Pig (scripting query language) • Hive (SQL) • Storm (Stream Processing) • Flume (Data Aggregator) • Sqoop (RDBMS bulk data transfer) • Oozie (workflow scheduling) • Mahout (machine learning) • Falcon (data lifecycle) • Spark, Cassandra etc (not based on Hadoop)
  5. 5. How does Hadoop work • JOB: binary code (Java JAR), configuration XML, any additional file(s) • The job gets uploaded into the cluster file system (usually HDFS) • SPLIT: a fragment of data (file) to be processes • The input data is broken into several splits • TASK: execution of the job JAR to process a split • Scheduler attempts to execute the task near the data split • MAP: takes unsorted, unclustered data and outputs clustered data • SHUFFLE: takes clustered data and produces sorted data • REDUCE: takes sorted data and produces desired output • Synergies • Processing locality: execute the code near the data storage, avoid data transfer • Algorithms scalability: • Map phase can scale out because assumes no sorting and no clustering • Reduce phase easy to write algorithms when data is guaranteed sorted and clustered • Execution reliability (monitoring, retry, preemptive execution etc)
  6. 6. MapReduce
  7. 7. How does Hive work • SQL submitted via CLI or Hiveserver(2) • Metadata describing tables stored in RDBMS • Driver compiles/optimizes execution plan • Plan and execution engine submitted to Hadoop as job • MR invokes Hive execution engine which executes plan HiveHadoop Metastore RDBMS HCatalog HDFS Driver Compiles, Optimizes MapReduce Task Task Split Split CLI Hiveserver2 ODBC JDBCShell Job Tracker Beeswax
  8. 8. Hive Query execution • Compilation/Optimization results in an AST containing operators eg: • FetchOperator: scans source data (the input split) • SelectOperator: projects column values, computes • GroupByOperator: aggregate functions (SUM, COUNT etc) • JoinOperator:joins • The plan forms a DAG of MR jobs • The plan tree is serialized (Kryo) • Hive Driver dispatches jobs • Multiple stages can result in multiple jobs • Task execution picks up the plan and start iterating the plan • MR emits values (rows) into the topmost operator (Fetch) • Rows propagate down the tree • ReduceSinkOperator emits map output for shuffle • Each operator implements both a map side and a reduce side algorithm • Executes the one appropriate for the current task • MR does the shuffle, many operators rely on it as part of their algorithm • Eg. SortOperator, GroupByOperator • Multi-stage queries create intermediate output and the driver submits new job to continue next stage • TEZ execution: map-reduce-reduce, usually eliminates multiple stages (more later) • Vectorized execution mode emits batches of rows (1024 rows)
  9. 9. Interacting with Hive • hive from shell prompt launches CLI • Run SQL command interactively • Can execute a batch of commands from a file • Results displayed in console • hiveserver2 is a daemon • JDBC and ODBC drivers for applications to connect to it • Queries submitted via JDBC/ODBC • Query results as JDBC/ODBC resultsets • Other applications embed Hive driver eg. beeswax
  10. 10. Hive QL • The dialect of SQL supported by Hive • More similar to MySQL dialect than ANSI-SQL • Drive toward ANSI-92 compliance (syntax, data types) • Query language: SELECT • DDL: CREATE/ALTER/DROP DATABASE/TABLE/PARTITION • DML: Only bulk insert operations • LOAD • INSERT • HIVE-5317 Implement insert, update, and delete in Hive with full ACID support
  11. 11. Supported data types • Numeric • tinyint, smallint, int, bigint • float, double • decimal(precision, scale) • Date/Time • timestamp • date • Character types • string • char(size) • varchar(size) • Misc. types • boolean • binary • Complex types • ARRAY<type> • MAP<type, type> • STRUCT<name:type, name:type> • UNIONTYPE<type, type, type>
  12. 12. Storage Formats • Text • ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘t’ LINES TERMINATED BY ‘n’; • Gzip or Bzip2 is automatically detected • SEQUENCEFILE (default map-reduce output) • ORC Files • Columnar, Compressed • Certain features only enabled on ORC • Parquet • Columnar, Compressed • Arbitrary SerDe (Serializer Deserializer)
  13. 13. DDL/Databases/Tables • CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name [COMMENT database_comment] [LOCATION hdfs_path] [WITH DBPROPERTIES (property_name=property_value, ...)]; • CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [SKEWED BY (col_name, col_name, ...) ON ([(col_value, col_value, ...), ...|col_value, col_value, ...]) [STORED AS DIRECTORIES] [ [ROW FORMAT row_format] [STORED AS file_format] | STORED BY '' [WITH SERDEPROPERTIES (...)] ] [LOCATION hdfs_path] [TBLPROPERTIES (property_name=property_value, ...)] [AS select_statement] • EXTERNAL tables are not owned by Hive (DROP TABLE lets the file in place) • Partitioning, Bucketing, Skew control allow precise control of file size (important for processing to achieve balanced MR splits) • ALTER TABLE … EXCHANGE PARTITION allows for fast (metadata only) move of data. • ALTER TABLE … ADD PARTITION adds to Hive metadata a partition already existing on disk • MSCK REPAIR TABLE … scans on-disk files to discover partitions and synchronizes Hive metadata •
  14. 14. Data Load • LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] • File format must match table format (no transformations) • INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement; INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement; • OVERWRITE replaces the data in the table (TRUNCATE + INSERT) • INTO appends the data (leaves existing data intact) • Dynamic Partitioning • Creates new partitions based on data • • INSERT OVERWRITE [LOCAL] DIRECTORY directory1 [ROW FORMAT row_format] [STORED AS file_format] SELECT ... FROM ... • Writes a file without creating Hive table
  15. 15. Hive SELECT syntax [WITH CommonTableExpression (, CommonTableExpression)*] SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list] ] [HAVING having_condition] [LIMIT number]
  16. 16. SELECT features • REGEX column specifications • SELECT `(ds|hr)?+.+` FROM sales • Virtual columns • INPUT_FILE_NAME • BLOCK_OFFSET_INSIDE_FILE • Sampling • SELECT … FROM source TABLESAMPLE (BUCKET 3 OUT OF 32 ON rand()); • SELECT … FROM source TABLESAMPLE (1 PERCENT); • SELECT … FROM source TABLESAMPLE (10M); • SELECT … FROM source TABLESAMPLE (100 ROWS);
  17. 17. Clustering and Distribution • ORDER BY • In strict mode must be followed by LIMIT as a last single reducer is required to sort all output • SORT BY • Only guarantees order of rows up to the last reducer • If multiple last reducers then only partially ordered result • DISTRIBUTE BY • Specifies how to distribute the rows to reducers, but does not require order • CLUSTER BY • Syntactic sugar for SORT BY and DISTRIBUTE BY
  18. 18. Subqueries • In FROM clause • SELECT … FROM (SELECT ….FROM …) AS alias … • In WHERE clause • SELECT … FROM …. WHERE EXISTS (SELECT … ) • SELECT … FROM …. WHERE col IN (SELECT …) • Must appear on the right-hand side in expressions • IN/NOT IN must project exactly one column • EXISTS/NOT EXISTS must contain correlated predicates • Otherwise they’re JOINs • Reference to parent query is only supported in WHERE clause subqueries • References of course required for correlated sub-queries
  19. 19. Common Table Expressions (CTE) • Supported for SELECT and INSERT • Do not support recursive syntax • with q1 as ( select key, value from src where key = '5') from q1 insert overwrite table s1 select *;
  20. 20. Lateral Views • Aka CROSS APPLY • Apply a table function to every row • SELECT … FROM table LATERAL VIEW explode(column) exTable AS exCol; • OUTER clause to include rows for which the function generates nothing • Similar to ANSI-SQL OUTER APPLY • Built-in table functions (UDTF): • explode(ARRAY) • explode(MAP) • inline(STRUCT) • json_tuple(json, k1, k2,…) • Returns k1, k2 from json as rows • parse_url(url, part, part, …) • Returns URL host, path, query:key • posexplode(ARRAY) • explode + index • stack(n, v1, v2, …, vk) • n rows, each with k/n columns
  22. 22. GROUPING SETS, CUBE, ROLLUP • GROUPING SET • Logical equivalent of having the same query run with different GROUP BY and then UNION the results • SELECT SUM(a) … GROUP BY a,b GROUPING SETS (a, (a,b)) SELECT SUM(a) … GROUP BY a UNION SELECT SUM(a) … GROUP BY a,b; • GROUP BY … WITH CUBE • Equivalent of adding all possible GROUPING SETS • GROUP BY a,b,c WITH CUBE GROUP BY a,b,c GROUPING SETS ((a,b,c), (a,b), (a,c), (b,c),(a), (b),(c), ()) • GROUP BY … WITH ROLLUP • Equivalent of adding all the GROUPING SETS that lead with the GROUP BY columns • GROUP BY a,b,c WITH ROLLUP GROUP BY a,b,c GROUPING SETS ((a,b,c), (a,b), (a))
  23. 23. XPath functions • xpath_...(xml_string, xpath_expression_string) • xpath_long returns a long • xpath_short returns a short • xpath_string returns a string • … • xpath(xml, xpath) returns an array of strings • SELECT xpath(col, ‘//configuration/property[name=“foo”]/value’)
  24. 24. User Defined Functions package com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; import; public final class Lower extends UDF { public Text evaluate(final Text s) { if (s == null) { return null; } return new Text(s.toString().toLowerCase()); } } CREATE FUNCTION myLower AS ‘Lower' USING JAR 'hdfs:///path/to/jar'; • Aggregate functions also possible, but more complicated • Must track amp side vs. reduce side and ‘merge’ the intermediate results •
  25. 25. TRANSFORM • Plug custom scripts into query execution • SELECT TRANSFORM(stuff) USING 'script‘ AS (thing1 INT, thing2 INT) • FROM ( FROM pv_users MAP pv_users.userid, USING 'map_script‘ CLUSTER BY key) map_output INSERT OVERWRITE TABLE pv_users_reduced REDUCE map_output.key, map_output.value USING 'reduce_script‘ AS date, count; • ual+Transform
  26. 26. Hive Indexes • Indexes aimed at reducing data for range scans • Fewer splits, fewer map tasks, less IO • Relies in Predicate Push Down • Order guarantee can simplify certain algorithms • GROUP BY aggregations can use streaming aggregates vs. hash aggregates • Hive does not need/use indexes for ‘seek’ like OLTP RDBMSs • Indexes are in almost every respect just another table with same data • Query Optimizer uses rewrite rules to leverage indexes • Indexes are not automatically maintained on LOAD/INSERT •
  27. 27. JOIN optimizations • Difficult problem in MR • Naïve join relies on MR shuffle to partition the data • Reducers can implement JOIN easily simply by merging the input, as is sorted • Is a size-of-data copy through the MR shuffle • MapJoin • If there is one big table (facts) and several small tables (dimensions) • Read all the small tables, hash them • serialize the hash into HDFS distributed cache • Done by driver as stage-0, before launching the actual query • The MapJoinOperator loads the small tables in memory • JOIN can be performed on-the-fly, on the map side, avoiding big shuffle • Requires live RAM, task JVM memory settings must allow for enough memory • Sort Merge Bucket (SMB) join • Between big tables that are bucketed by the same key • And the bucketing key is also the join key • Map task scans buckets from multiple tables in parallel • MR only knows about one of them • For the rest the SMBJoinOperator simulates a MR environment to scan them •
  28. 28. Partitioning in Hive • CREATE TABLE …. PARTITIONED BY (…) • Separate data directory created for each distinct combination of partitioning column values • Can result in many small tables if abused • Use • Use also Bucketing • CREATE TABLE … PARTITIONED BY (…) CLUSTERED BY (…) SORTED BY (…) INTO … BUCKETS • Bucketing helps many queries • ables
  29. 29. How to get started with Hive • HDPInsight 3.1 comes with Hive 0.13 • Hortonworks Sandbox (VM) has Hive 0.13 • Cloudera CDH 5 VM comes with Hive 0.12 • Build it yourself  • • Mailing list: •