Apache Hive for modern DBAs

Apache Hive for modern
DBAs
Luís Marques

About me
Oracle ACE
Data and Linux geek
Long time opensource
supporter
works for @redgluept as
Data Architect
@drune

Big Data Thinking Strategy
●Think small
●Think big
●Don’t think at all (hype is here)

What is Apache Hive?
●open source, TB/PB scale date warehousing
framework based on Hadoop
●The first and more complete SQL-on-”Hadoop”
●SQL:2003 and SQL:2011 compatible
●Data store on several formats
●Several execution engines available
●Interactive Query support (In-memory cache)

Apache Hive - Before you ask
●Datawarehouse/OLAP activities (data mining, data
exploration, batch processing, ETL, etc) - “The
heavy lifting of data”
●Low cost scaling, built as extensibility in mind
●Use large datasets (gigabytes/terabytes) scale
●Don’t use Hive for any OLTP activities
●ACID exists, not recommended yet

The reason behind Hive
I had written, as part of working with the Feed team - what became - a rather complicated MR
job to rank friends by mutual friends.
In doing so I had pretty much used every Hadoop trick in the bag (partitioners, separate
map and reduce sorting keys, comparators, in-memory hash tables and so on) and realized how
hard it was to write an optimal MR job (particularly on large data sets).
Assembling data into complex data structures was also painful.
I really wanted to see these types of operators exposed in a high level declarative form so
that the average user would never have to go through this. Fortunately - our team had
Oracle veterans well versed in the art of SQL.
Joydeep Sen Sarma (Facebook)

The reason behind Hive
Instead of complex MR jobs
You have declarative language...

Apache Hive versions & branches
master branch-1
Version 2.x
New code
New features
Version 1.x
Stable
Backwards
compatibility
Critical
bugs
Hadoop 1.x and 2.x
supported
Hadoop 2.x
supported
stable features

Data Model (data units & types)
●Supports primitive column types (integers,
numbers, strings, date time and booleans)
●Supports complex types: Structs, Maps and
Arrays
●Concept of databases, tables, partitions and
buckets
●SerDe: serialize and deserialized API is used to
move data in and out of tables

Data Model (partitions & bucketing)
● Partitioning: used for distributing load horizontally,performance benefit,
organization data
PARTITIONED BY (flightName STRING, AircraftName STRING)
/employees/flightName=ABC/AircraftName=XYZ
● Buckets (clusters): decomposing data sets into more manageable parts, help
on map-side joins, and correct sampling on the same bucket
“Records with the same flightID will always be stored in the same bucket.
Assuming the number of flightID is much greater than the number of buckets, each
bucket will have many flightIDs”
CLUSTERED BY (flightID) INTO XX BUCKETS;

Data Model (complex data types)
Array Ordered collection of
fields. Fields of the
same type
array(1,2)
Map Unordered key value
pairs. Keys are
primitives, values are
any type
Map (‘a’, 1, ‘b’, 2)
Struct A collection of named
fields
Struct(‘a’,10, 2.5)

HiveQL
●HiveQL is an SQL-like query language for Hive
●Supports DDL and DML
●Supports multi-table inserts
●Possible to write custom map-reduce scripts
●Supports UDF, UDAF UDTF

DDL (some examples)
HIVE> CREATE DATABASE/SCHEMA, TABLE, VIEW, INDEX
HIVE> DROP DATABASE/SCHEMA, TABLE, VIEW, INDEX
HIVE> TRUNCATE TABLE
HIVE> ALTER DATABASE/SCHEMA, TABLE, VIEW
HIVE> SHOW DATABASES/SCHEMAS, TABLES, TBLPROPERTIES, VIEWS,
PARTITIONS, FUNCTIONS
HIVE> DESCRIBE DATABASE/SCHEMA, table_name, view_name

File formats
● Parquet: compressed, efficient columnar data
representation available to any project in the Hadoop
● ORC: made for Hive, support Hive type model,columnar
storage, block compression, predicate pushdown, ACID*,
etc
● Avro: JSON for defining data types and protocols, and
serializes data in a compact binary format
● Compressed file formats (LZO, *.GZIP)
● Plain Text Files
● Any other type to data subject to a format is possible to be
read (csv, json, xml, etc)

ORC
●Stored as columns and compressed = smaller disk
reads
●ORC has a built-in index, min/max values, and
other aggregates (eg: sum,max) = skip entire
blocks to speed up reads
●ORC implements predicate pushdown and bloom
filters
●ORC scale
●You should use it :-)

Indexing
● Not recommended because of ORC;
● ORC has build in Indexes which allow the format to skip
blocks of data during read
● Hive indexes are implemented as tables
● Compact indexes and bitmap indexes supported
● Tables that provide information about which data is in
which blocks and are used to skip data (like ORC already
does)
● Not supported on Tez engine - ignored
● Indexes in Hive are not like indexes in other databases.

Hive Architecture
Hive Web
Interface
Hive CLI (beeline, hive)
Hive JDBC/ODBC
Driver
Compiler (Parser, Semantic Analyser,
Logical Plan Generator, Query plan
Generator)
Executor
Optimizer
Metastore
client
Trift Server (HiveServer2)
Metastore RDBMS
Execution
Engines
Map Reduce Tez Spark
Resource Management YARN
Storage HDFS HBase
Azure Storage
Amazon S3

Metastore
● Typically stored in a RDBMS (MySQL; SQLServer;
PostgreSQL, Derby*) - ACID and concurrency on metadata
querys
● Contains: metadata for databases, tables or partitions
● Provides two features: data discovery and data abstraction
● Data abstraction: provide information about data formats,
extractors and loaders in table creation and reused, (ex:
dictionary tables - Oracle)
● Data discovery: discover relevant and specific data, allow
other tools to use metadata to explore data (Ex: SparkSQL)

Execution engines
● 3 execution engines are available:
○ MapReduce (mr)
○ Tez
○ Spark
MR: The original, most stable and more reliable, batch oriented, disk-
based parallel (like traditional Hadoop MR jobs).
Tez: High performance batch and interactive data processing. Stable in
99% of the time. The one that you should use. Default on HDP.
Spark: Uses Apache Spark (in-memory computing platform), High-
performance (like Tez), not used in production (yet), good progress

MapReduce vs Tez/Spark
MapReduce:
● One pair of map and reduce does one level of aggregation over the
data. Complex computations typically require multiple such steps.
Tez/Spark:
● DAG (Directed Acyclic Graph)
● The graph does not have cycles because the fault tolerance
mechanism used by Tez is re-execution of failed tasks
● The limitations of MapReduce in Hadoop became a key point to
introduce DAG
● Pipelining consecutive map steps into one
● Enforce concurrency and serialization between MapReduce jobs

Tez & DAGs
DAG Definition:
● Data processing is expressed in the form of a directed acyclic graph
(DAG)
Two main components:
● vertices - in the graph representing processing of data
○ user logic, that analyses and modifies the data, sits in the vertices
● edges - representing movement of data between the processing
○ Defines routing of data between tasks (One-To-One, Broadcast
Scatter-Gather)
○ Defines when a consumer task is scheduled (Sequential,
Concurrent)
○ Defines the lifetime/reliability of a task output

Hive Cost Based Optimizer - Why
● Distributed SQL query processing in Hadoop differs from conventional
relational query engine when it comes to handling of intermediate
result sets
● Query processing requires sorting and reassembling of intermediate
result set - shuffling
● Most of the existing optimizations in Hive are about minimizing
shuffling cost and logical optimizations like filter push down,
projection pruning and partition pruning
● Join reordering and join algorithm possible with cost based optimizer.

Hive CBO - What to get
● Based on a project called Apache Calcite (https://calcite.apache.org/)
● You can get using a Cost Based Optimizer:
○ How to order Join (join reordering)
○ Algorithm to use for a Join
○ Intermediate result be persisted or should it be recomputed on
failure
○ degree of parallelism at any operator (number of mappers and
reducers
○ Semi Join selection
○ (others optimizer tricks like histograms)

Hive - The present-future
● Tez and Spark head to head on performance and stability
● LLAP (Long Live and Process) - Hive interactive querys
● ACID

Hive next big thing: LLAP
● Sub second querys (Interactive Querys)
● In-memory caching layer with async I/O
● Fast concurrent execution
● Move from disk oriented to memory oriented execution (trend)
● Disks are connect to CPU via network - data locality is not relevant

Thank you
Questions?
@drune
https://www.linkedin.com/in/lc
marques/
luis.marques@redglue.eu
@redgluept
www.redglue.eu

Apache Hive for modern DBAs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Hive for modern DBAs

Similar to Apache Hive for modern DBAs (20)

Recently uploaded

Recently uploaded (16)

Apache Hive for modern DBAs

Editor's Notes