Hive @ Bucharest Java User Group

HIVE
Bucharest Java User Group
July 3, 2014

whoami
• Developer with SQL Server team since 2001
• Apache contributor
• Hive
• Hadoop core (security)
• stackoverflow user 105929s
• @rusanu

What is HIVE
• Datawarehouse for querying and managing large datasets
• A query engine that use Hadoop MapReduce for execution
• A SQL abstraction for creating MapReduce algorithms
• SQL interface to HDFS data
• Developed at Facebook
VLDB 2009: Hive - A Warehousing Solution Over a Map-Reduce
Framework
• ASF top project since September 2010

What is Hadoop
Hadoop Core
• Distributed execution engine
• MapReduce
• YARN
• TEZ
• Distributed File System HDFS
• Tools for administering the
execution engine and HDFS
• Libraries for writing MapReduce
jobs
Hadoop Ecosystem
• HBase (BigTable)
• Pig (scripting query language)
• Hive (SQL)
• Storm (Stream Processing)
• Flume (Data Aggregator)
• Sqoop (RDBMS bulk data transfer)
• Oozie (workflow scheduling)
• Mahout (machine learning)
• Falcon (data lifecycle)
• Spark, Cassandra etc (not based on Hadoop)
hadoopecosystemtable.github.io

How does Hadoop work
• JOB: binary code (Java JAR), configuration XML, any additional file(s)
• The job gets uploaded into the cluster file system (usually HDFS)
• SPLIT: a fragment of data (file) to be processes
• The input data is broken into several splits
• TASK: execution of the job JAR to process a split
• Scheduler attempts to execute the task near the data split
• MAP: takes unsorted, unclustered data and outputs clustered data
• SHUFFLE: takes clustered data and produces sorted data
• REDUCE: takes sorted data and produces desired output
• Synergies
• Processing locality: execute the code near the data storage, avoid data transfer
• Algorithms scalability:
• Map phase can scale out because assumes no sorting and no clustering
• Reduce phase easy to write algorithms when data is guaranteed sorted and clustered
• Execution reliability (monitoring, retry, preemptive execution etc)

How does Hive work
• SQL submitted via CLI or
Hiveserver(2)
• Metadata describing tables
stored in RDBMS
• Driver compiles/optimizes
execution plan
• Plan and execution engine
submitted to Hadoop as job
• MR invokes Hive execution
engine which executes plan
HiveHadoop
Metastore
RDBMS
HCatalog
HDFS
Driver
Compiles, Optimizes
MapReduce
Task
Task
Split
Split
CLI Hiveserver2
ODBC JDBCShell
Job
Tracker
Beeswax

Hive Query execution
• Compilation/Optimization results in an AST containing operators eg:
• FetchOperator: scans source data (the input split)
• SelectOperator: projects column values, computes
• GroupByOperator: aggregate functions (SUM, COUNT etc)
• JoinOperator:joins
• The plan forms a DAG of MR jobs
• The plan tree is serialized (Kryo)
• Hive Driver dispatches jobs
• Multiple stages can result in multiple jobs
• Task execution picks up the plan and start iterating the plan
• MR emits values (rows) into the topmost operator (Fetch)
• Rows propagate down the tree
• ReduceSinkOperator emits map output for shuffle
• Each operator implements both a map side and a reduce side algorithm
• Executes the one appropriate for the current task
• MR does the shuffle, many operators rely on it as part of their algorithm
• Eg. SortOperator, GroupByOperator
• Multi-stage queries create intermediate output and the driver submits new job to continue next stage
• TEZ execution: map-reduce-reduce, usually eliminates multiple stages (more later)
• Vectorized execution mode emits batches of rows (1024 rows)

Interacting with Hive
• hive from shell prompt launches CLI
• Run SQL command interactively
• Can execute a batch of commands from a file
• Results displayed in console
• hiveserver2 is a daemon
• JDBC and ODBC drivers for applications to connect to it
• Queries submitted via JDBC/ODBC
• Query results as JDBC/ODBC resultsets
• Other applications embed Hive driver eg. beeswax

Hive QL
• The dialect of SQL supported by Hive
• More similar to MySQL dialect than ANSI-SQL
• Drive toward ANSI-92 compliance (syntax, data types)
• Query language: SELECT
• DDL: CREATE/ALTER/DROP DATABASE/TABLE/PARTITION
• DML: Only bulk insert operations
• LOAD
• INSERT
• HIVE-5317 Implement insert, update, and delete in Hive with full ACID
support

Supported data types
• Numeric
• tinyint, smallint, int,
bigint
• float, double
• decimal(precision, scale)
• Date/Time
• timestamp
• date
• Character types
• string
• char(size)
• varchar(size)
• Misc. types
• boolean
• binary
• Complex types
• ARRAY<type>
• MAP<type, type>
• STRUCT<name:type, name:type>
• UNIONTYPE<type, type, type>

Storage Formats
• Text
• ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘t’
LINES TERMINATED BY ‘n’;
• Gzip or Bzip2 is automatically detected
• SEQUENCEFILE (default map-reduce output)
• ORC Files
• Columnar, Compressed
• Certain features only enabled on ORC
• Parquet
• Columnar, Compressed
• Arbitrary SerDe (Serializer Deserializer)

DDL/Databases/Tables
• CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];
• CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...) ON ([(col_value, col_value, ...), ...|col_value, col_value, ...])
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format] [STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
[AS select_statement]
• EXTERNAL tables are not owned by Hive (DROP TABLE lets the file in place)
• Partitioning, Bucketing, Skew control allow precise control of file size (important for processing to achieve balanced MR splits)
• ALTER TABLE … EXCHANGE PARTITION allows for fast (metadata only) move of data.
• ALTER TABLE … ADD PARTITION adds to Hive metadata a partition already existing on disk
• MSCK REPAIR TABLE … scans on-disk files to discover partitions and synchronizes Hive metadata
• https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

Data Load
• LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2 ...)]
• File format must match table format (no transformations)
• INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2
...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)]
select_statement1 FROM from_statement;
• OVERWRITE replaces the data in the table (TRUNCATE + INSERT)
• INTO appends the data (leaves existing data intact)
• Dynamic Partitioning
• Creates new partitions based on data
• https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert
• INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format]
SELECT ... FROM ...
• Writes a file without creating Hive table

Hive SELECT syntax
[WITH CommonTableExpression (, CommonTableExpression)*]
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[HAVING having_condition]
[LIMIT number]

SELECT features
• REGEX column specifications
• SELECT `(ds|hr)?+.+` FROM sales
• Virtual columns
• INPUT_FILE_NAME
• BLOCK_OFFSET_INSIDE_FILE
• Sampling
• SELECT … FROM source TABLESAMPLE (BUCKET 3 OUT OF 32 ON
rand());
• SELECT … FROM source TABLESAMPLE (1 PERCENT);
• SELECT … FROM source TABLESAMPLE (10M);
• SELECT … FROM source TABLESAMPLE (100 ROWS);

Clustering and Distribution
• ORDER BY
• In strict mode must be followed by LIMIT as a last single reducer is required to
sort all output
• SORT BY
• Only guarantees order of rows up to the last reducer
• If multiple last reducers then only partially ordered result
• DISTRIBUTE BY
• Specifies how to distribute the rows to reducers, but does not require order
• CLUSTER BY
• Syntactic sugar for SORT BY and DISTRIBUTE BY

Subqueries
• In FROM clause
• SELECT … FROM (SELECT ….FROM …) AS alias …
• In WHERE clause
• SELECT … FROM …. WHERE EXISTS (SELECT … )
• SELECT … FROM …. WHERE col IN (SELECT …)
• Must appear on the right-hand side in expressions
• IN/NOT IN must project exactly one column
• EXISTS/NOT EXISTS must contain correlated predicates
• Otherwise they’re JOINs
• Reference to parent query is only supported in WHERE clause subqueries
• References of course required for correlated sub-queries

Common Table Expressions (CTE)
• Supported for SELECT and INSERT
• Do not support recursive syntax
• with q1 as (
select key, value from src where key = '5')
from q1
insert overwrite table s1
select *;

Lateral Views
• Aka CROSS APPLY
• Apply a table function to every row
• SELECT … FROM table
LATERAL VIEW explode(column) exTable AS exCol;
• OUTER clause to include rows for which the function generates nothing
• Similar to ANSI-SQL OUTER APPLY
• Built-in table functions (UDTF):
• explode(ARRAY)
• explode(MAP)
• inline(STRUCT)
• json_tuple(json, k1, k2,…)
• Returns k1, k2 from json as rows
• parse_url(url, part, part, …)
• Returns URL host, path, query:key
• posexplode(ARRAY)
• explode + index
• stack(n, v1, v2, …, vk)
• n rows, each with k/n columns

Windowing and analytical functions
• LEAD, LAG, FIRST_VALUE, LAST_VALUE
• RANK, ROW_NUMBER, DENSE_RANK, PERCENT_RANK, NTILE
• OVER clause for aggregates
• PARTITION BY
• SELECT SUM(a) OVER (PARTITION BY b)
• ORDER BY
• SELECT SUM(a) OVER (PARTITION BY b ORDER BY c)
• window specification
• SELECT SUM(a) OVER (PARTITION BY b ORDER BY c ROWS 3 PRECEDING AND 3
FOLLOWING)
• WINDOW clause
• SELECT SUM(b) OVER w
FROM t
WINDOW w AS (PARTITION BY b ORDER BY c ROWS BETWEEN CURRENT ROW AND 2
FOLLOWING)

GROUPING SETS, CUBE, ROLLUP
• GROUPING SET
• Logical equivalent of having the same query run with different GROUP BY and then UNION
the results
• SELECT SUM(a) … GROUP BY a,b GROUPING SETS (a, (a,b))
SELECT SUM(a) … GROUP BY a
UNION
SELECT SUM(a) … GROUP BY a,b;
• GROUP BY … WITH CUBE
• Equivalent of adding all possible GROUPING SETS
• GROUP BY a,b,c WITH CUBE
GROUP BY a,b,c GROUPING SETS ((a,b,c), (a,b), (a,c), (b,c),(a), (b),(c), ())
• GROUP BY … WITH ROLLUP
• Equivalent of adding all the GROUPING SETS that lead with the GROUP BY columns
• GROUP BY a,b,c WITH ROLLUP
GROUP BY a,b,c GROUPING SETS ((a,b,c), (a,b), (a))

XPath functions
• xpath_...(xml_string, xpath_expression_string)
• xpath_long returns a long
• xpath_short returns a short
• xpath_string returns a string
• …
• xpath(xml, xpath) returns an array of strings
• SELECT xpath(col, ‘//configuration/property[name=“foo”]/value’)

User Defined Functions
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public final class Lower extends UDF {
public Text evaluate(final Text s) {
if (s == null) { return null; }
return new Text(s.toString().toLowerCase());
}
}
CREATE FUNCTION myLower AS ‘Lower' USING JAR 'hdfs:///path/to/jar';
• Aggregate functions also possible, but more complicated
• Must track amp side vs. reduce side and ‘merge’ the intermediate results
• https://cwiki.apache.org/confluence/display/Hive/GenericUDAFCaseStudy

TRANSFORM
• Plug custom scripts into query execution
• SELECT TRANSFORM(stuff)
USING 'script‘
AS (thing1 INT, thing2 INT)
• FROM (
FROM pv_users
MAP pv_users.userid, pv_users.date
USING 'map_script‘
CLUSTER BY key) map_output
INSERT OVERWRITE TABLE pv_users_reduced
REDUCE map_output.key, map_output.value
USING 'reduce_script‘
AS date, count;
• https://cwiki.apache.org/confluence/display/Hive/LanguageMan
ual+Transform

Hive Indexes
• Indexes aimed at reducing data for range scans
• Fewer splits, fewer map tasks, less IO
• Relies in Predicate Push Down
• Order guarantee can simplify certain algorithms
• GROUP BY aggregations can use streaming aggregates vs. hash aggregates
• Hive does not need/use indexes for ‘seek’ like OLTP RDBMSs
• Indexes are in almost every respect just another table with same data
• Query Optimizer uses rewrite rules to leverage indexes
• Indexes are not automatically maintained on LOAD/INSERT
• https://cwiki.apache.org/confluence/display/Hive/IndexDev

JOIN optimizations
• Difficult problem in MR
• Naïve join relies on MR shuffle to partition the data
• Reducers can implement JOIN easily simply by merging the input, as is sorted
• Is a size-of-data copy through the MR shuffle
• MapJoin
• If there is one big table (facts) and several small tables (dimensions)
• Read all the small tables, hash them
• serialize the hash into HDFS distributed cache
• Done by driver as stage-0, before launching the actual query
• The MapJoinOperator loads the small tables in memory
• JOIN can be performed on-the-fly, on the map side, avoiding big shuffle
• Requires live RAM, task JVM memory settings must allow for enough memory
• Sort Merge Bucket (SMB) join
• Between big tables that are bucketed by the same key
• And the bucketing key is also the join key
• Map task scans buckets from multiple tables in parallel
• MR only knows about one of them
• For the rest the SMBJoinOperator simulates a MR environment to scan them
• https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization

Partitioning in Hive
• CREATE TABLE …. PARTITIONED BY (…)
• Separate data directory created for each distinct combination of
partitioning column values
• Can result in many small tables if abused
• Use org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
• Use also Bucketing
• CREATE TABLE …
PARTITIONED BY (…)
CLUSTERED BY (…) SORTED BY (…) INTO … BUCKETS
• Bucketing helps many queries
• https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedT
ables

How to get started with Hive
• HDPInsight 3.1 comes with Hive 0.13
• Hortonworks Sandbox (VM) has Hive 0.13
• Cloudera CDH 5 VM comes with Hive 0.12
• Build it yourself 
• https://cwiki.apache.org/confluence/display/Hive/AdminManual+Installation
• Mailing list: user@hive.apache.org
•

Hive @ Bucharest Java User Group

More Related Content

What's hot

Viewers also liked

Similar to Hive @ Bucharest Java User Group

Recently uploaded

Hive @ Bucharest Java User Group