Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善

1 © Hortonworks Inc. 2011–2018. All rights reserved
Hive 3.0
- HDPの最新バージョンで実現する新機能
とパフォーマンス改善
Zhen Zeng
Solution Engineer
2018/09/21

Agenda
• 自己紹介
• Hive基礎
• Hive 3の新機能
• まとめ

自己紹介

自己紹介
Zhen Zeng(曾臻)
Hortonworksソリューションエンジニア。
これまでは、ヤフー、ITコンサルティングファーム、
SIerにてエンジニアを従事。
ビッグ・データ、データガバナンス、PaaS、
Webアプリケーションなどのアーキテクト、
設計や実装の経験を有す。

Hive基礎

▪ 現在Hiveを使っている人？
▪ Hiveがバッチしか使えないと思っている人？
質問
SQL

▪ Powerful
▪ Familiar
▪ Flexible
▪ 普及している
▪ 周辺ツールが豊富 - Deep Ecosystem
SQL Is King
SQL

RDBMS vs SQL on Hadoop
SQL Engine
Data in
HDFS
Meta Data
e.g. MySQL
Data (tables)
Meta Data
SQL Engine
SQL on Hadoop
(could be in three separate systems)
RDBMS
(one logical system)
Client Client

SQL on Hadoop – Schema on Read
Client
SQL
Engine
HDFS Meta Data
e.g. MySQL
Client
SQL on Hadoop
(Schema on Read)
RDBMS
(Schema on Write)
1 CREATE TABLE …
2 INSERT INTO …
CHECK
SCHEMA
✔
3 SELECT …
2 CREATE TABLE …
1 Ingest data into HDFS
3 SELECT …
CHECK
SCHEMA
✔Flume

SQL on Hadoop – 分散処理
Hive
Server
Data
Node
Client
...
Data
Node
Data
Node
Data
Node
Data
Node
machine 2 machine 3 machine 4 machine 5 machine n
Client
RDBMS
machine 1
machine 1

Apache Hiveとは?
Apache Hive : SQL gateway to Hadoop
Features:
• Extensive SQL:2011 Support
• ACID Transactions
• In-Memory Caching
• Cost-Based Optimizer
• User-Based Dynamic Security
• Replication and Disaster Recovery
• JDBC and ODBC Support
• Compatible with every major BI Tool
• 300+ PB Scaleのデータでも実績あり

Hive LLAP : Hadoop native solution for Interactive Analytics
• Open Source
• Hadoop Native Integration
• Security
Hive
Server 2
LLAP
YARN
HDFS
Spark
MR
Pig
ETL
Interactive queries
Interactive queries

Apache Hive: Modern ArchitectureStorage
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Engine
SQL Engines
Row Engine Vector Engine
SQL
SQL Support
SQL:2011 Optimizer HCatalog HiveServer2
Cache
Block Cache
Linux Cache
Distributed
Execution
Hadoop 1
MapReduce
Hadoop 2
Tez Spark
Vector Cache
LLAP
Persistent Server
Historical
Current
Legend
BI Tool
JDBC
ODBC

Apache Hive: Fast Facts
Most Queries Per Hour
100,000 Queries Per Hour
(Yahoo Japan)
分析パフォーマンス
1億行/秒 Per Node
(with Hive LLAP)
Largest Hive Warehouse
300+ PB Raw Storage
(Facebook)
Largest Cluster
4,500+ Nodes
(Yahoo)

Hiveの進化: MR, Tez, Tez + LLAP
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
T T T
R R
R
T T
T
R
M M M
R R
R
M M
R
R
HDFS
In-Memory
columnar cache
Map – Reduce
Intermediate results in HDFS
Tez
Optimized Pipeline
Tez with LLAP
Resident process on Nodes
Map tasks
read HDFS
「HiveがBatchしか出来ない」は
過去の歴史。Hive 2から処理速
度が劇的に改善

Continuous evolution towards scalable performance
= 100X
Hive + MR Hive + Tez Hive + Tez + LLAP
= 20X
Hive 1.0 Hive 1.3 Hive 2.0
Batch SQL 分析SQL Interactive SQL
• ETL
• レポート作成
• Data Mining
• 深い分析
• レポート作成
• BI Tools:
Microstrategy,
Cognos
• Ad-Hoc
• Drill-Down
• Agile BI Tools:
Tableau, Power BI
Hive 3.0
Interactive SQL
• Ad-Hoc
• Result Cache
• Better W/L Mangt.
• Agile BI Tools:
Tableau, Power BI
HDP 3

Hive 3の新機能

大幅に進化したHive 3
• ユースケースが更に増えた
• EDW Offload
• Interactive Query
• OLAP Query
• Real-time ingestion
• Unified SQL
• Data Federation (SQLServer,
Oracle, etc)
• Spark-hive Connector
• 高性能
• Low latency
• Fast response time
• Cloud Native
• S3, GCS, Azure
Real-Time Data Streams+
Workload Management+
ACID Transactions+
Materialized Views+
Scales Horizontally to Petabytes+

Hive LLAP – MPP Performance at Hadoop Scale
Deep
Storage
YARN Cluster
Resource Mgmt LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
HiveServer2
(Query
Endpoint)
ODBC /
JDBC
SQL
Queries In-Memory
Cache
HDFS and
Compatible
S3 WASB Isilon
BI Pool
ETL Pool
Background
Pool
同じクラスタで簡単にBatch とInteractive を両方実行できる
In-Memory
Cache
In-Memory
Cache

EDW analyst pipeline
Tableau
BI systems
Materialized
view
Surrogate key
(代替キー )
Constrains
Query
Result
Cache
Workload
management
ACID v2
&
ACID on
default
• Results return
from HDFS/cache
directly
• Reduce load from
repetitive queries
• 同時に実行でき
るクエリ数が更に
増えた
• Reduce resource
starvation in large
clusters
• Also:
Active/Passive HA
• More “tools” for
optimizer to use
• More ”tools” for
DBAs to
tune/optimize
• Invisible tuning of
DB from users’
perspective
• ACID v2 is as fast
as regular tables

21 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 21
HIVE-18513: Query 結果 cache
実際クエリを実行せずに、ストレージからク
エリ結果を直接返す (e.g. HDFS)
前提：
同じクエリが実行されたことがある
ダッシュボード、レポートでの利用時、重複
クエリがよくあるので、リソース節約＆処理
パフォーマンス向上に役に立つ
Without
cache
With
cache

HIVE-18513: Query result cache details
⬢ hive.query.results.cache.enabled=true (on by default)
⬢ hive managed tablesのみ有効
– If you JOIN an external table with Hive managed table, Hive will fall back to executing the
full query. Because Hive can’t know if external table data has changed
⬢ Works with ACID
– That means if Hive table has been updated, the query will be rerun automatically
⬢ LLAP cacheと違う
– LLAP cache は読み込みデータのcache. That means multiple queries can benefit by
avoiding reading from disk. Speeds up the read path.
– Result cache effectively bypasses execution of query
⬢ Stored at /tmp/hive/__resultcache__/, default space is 2GB, LRU eviction
– hive.query.results.cache.max.size (bytes)で設定変更可能

LLAP Resource Management in 3.x
• リソースプラン
例:
• Daytime
• Nightime
• EndOfQuarter
• リソースプール
• Capacity based
• Fair or FIFO scheduling
• Automatic mapping
• Map query to pool
• User | Group | Application
• Triggers to
• Move queries
• Kill queries
例：出力結果
が大きすぎる
例：実行時間
が長過ぎる

リソースプランの例
CREATE RESOURCE PLAN daytime;
CREATE POOL daytime.bi WITH ALLOC_FRACTION=0.8, QUERY_PARALLELISM=5;
CREATE POOL daytime.etl WITH ALLOC_FRACTION=0.2, QUERY_PARALLELISM=20;
CREATE RULE downgrade IN daytime WHEN total_runtime > 300 THEN MOVE etl;
ADD RULE downgrade TO bi;
CREATE APPLICATION MAPPING tableau in daytime TO bi;
ALTER PLAN daytime SET default pool= etl;
APPLY PLAN daytime;
daytime
bi: 75% etl: 25%
Downgrade when total_runtime>300
QUEUEの移動

HIVE-17481: LLAP workload management
⬢ LLAP cluster リソースを効率良く共有する
– Resource allocation per user policy; separate ETL and BI, etc.
⬢ Resources based guardrails
– Protect against long running queries, high memory usage
⬢ Improved, query-aware scheduling
– Scheduler is aware of query characteristics, types, etc.
– Fragments easy to pre-empt compared to containers
– Queries はクラスタから決めた割合のリソースが保証され
る、更に空いているリソースも無駄なく使える

Concurrency向上のチューニング
 NVMe SSDs
• Metastore DB backend
• 50x to 60x improvements in query cache performance (hive_locks)
• Namenode
• 5-6x improvement in JDBC startup performance
• Keep Namenode edit logs on SSD
• Zookeeper
• RM State store
• HS2 active/passive info
• LLAP service registry
• HDFS
• /tmp folder
• Yarn logs and Yarn local
 複数台 HS2サーバーの併用
• Doesn’t support Workload Management in HDP 3.0

EDW Features

Materialized Views & DW optimizations
• MVsでaggregates とjoins を加速
• View navigation via CBO/Calcite
• Optionally allow rewrites against
out-of-date materializations

Materialized view
How many unique city-pairs are there?
SELECT count(*)/2
FROM (
SELECT dest,origin,count(*)
FROM flights_hdfs
GROUP BY dest,origin
) as T;
Sub-query can be materialized
CREATE MATERIALIZED VIEW mv1
AS
SELECT dest,origin,count(*)
FROM flights_hdfs
GROUP BY dest,origin;

Materialized view navigation
The query planner will automatically navigate to existing views

Hive + Druid : One SQL Interface Across Real-Time and Historical
OLAP Cubes SQL Tables
Streaming Data Historical Data
Unified SQL Layer
Pre-Aggregate ACID MERGE
Easily ingest event
data into OLAP cubes
Keep data up-to-date
with Hive MERGE
Build OLAP Cubes from Hive
Archive data to Hive for history
Run OLAP queries in real-time
or Deep Analytics over all history
Deep AnalyticsReal-Time Query

Information schema
Question:
どのdatabaseのどのtableが”ssn”を含むカラムを持っているか、洗い出せる？
SELECT columns.table_schema, columns.table_name
FROM information_schema.columns
WHERE column_name LIKE ‘%ssn%’;
This is very useful for EDW offload use cases where some queries depend on
databases’ metadata information.

HIVE-1555: JDBC connector – Data federation
⬢ How did we build the
information_schema?
– We basically mapped part of the
metastore into Hive’s table space!
⬢ Under the hood we used Hive-
JDBC connector
⬢ Read-only for now
⬢ Manual table mapping for now

JDBC Table mapping example
CREATE TABLE HiveTable
(
id INT,
name varchar
)
CREATE EXTERNAL TABLE HiveTable
(
id INT,
name STRING
)
STORED BY
'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
"hive.sql.database.type" = "POSTGRES",
"hive.sql.jdbc.driver"="org.postgresql.Driver",
"hive.sql.jdbc.url"="jdbc:postgresql://hwx-demo-
1.field.hortonworks.com:5432/jdbctest",
"hive.sql.dbcp.username"="jdbctest",
"hive.sql.dbcp.password"="",
"hive.sql.query" = "SELECT ID, NAME FROM
hivetable",
"hive.sql.column.mapping" = "id=ID, name=NAME",
"hive.jdbc.update.on.duplicate" = "true"
);
In Postgres In Hive

Database key 101
How many types of keys can you
name in context of databases?
Why those matter?
⬢ EDW solutions usually depend on
those features
⬢ Keys allows database engine to
make assumptions and run faster
⬢ Essential for building relational
databases.
⬢ Primary key
⬢ Secondary key
⬢ Unique key
⬢ Foreign key
⬢ Composite key
⬢ Natural key
⬢ Surrogate key
⬢ Super key
⬢ Candidate key

Surrogate key(代替キー) 生成
⬢ Surrogate key can replace wide, multiple composite keys.
⬢ JOIN on 2 integers are way faster than 2 JOINs on 2 Strings
SELECT ROW_NUMBER() OVER () as row_num, * FROM airlines;
+----------+----------------+----------------------------------------------------+
| row_num | airlines.code | airlines.description |
+----------+----------------+----------------------------------------------------+
| 1 | 02Q | Titan Airways |
| 2 | 04Q | Tradewind Aviation |
| 3 | 05Q | Comlux Aviation, AG |
| 4 | 06Q | Master Top Linhas Aereas Ltd. |
| 5 | 07Q | Flair Airlines Ltd. |
| 6 | 09Q | Swift Air, LLC |
| 7 | 0BQ | DCA |
| 8 | 0CQ | ACM AIR CHARTER GmbH |

NOT NULL と制約(CONSTRAINT)
⬢ Essential for data integrity
⬢ Only works on ACID and Append-only tables
⬢ hive.constraint.notnull.enforce = true
Example:
CREATE TABLE Persons (
ID Int NOT NULL,
Name String NOT NULL,
Age Int
);

Default value
⬢ Ensures a value exists
⬢ Can be overwritten in INSERT/UPDATE
statements
⬢ Useful in EDW offload cases
Example:
CREATE TABLE Persons_default (
ID Int NOT NULL,
Name String NOT NULL,
Age Int,
Creator String DEFAULT CURRENT_USER(),
CreateDate Date DEFAULT CURRENT_DATE()
);

ACID v2
 Performance just as good as regular non-ACID tables
• Simpler solution, no more requirement for bucketing
 There are two parts to ACIDv2 –
• one is insert-only ACID and other is full CRUD ACID
 Insert-only ACID is support for *ALL* formats
• Parquet
• Avro
• ORC
• Text
 Enables new optimizations
• Incremental updates of MV & query cache
• Query cache の一貫性(consistency)
 差分ファイル沢山ある場合、パフォーマンスが下がる恐
れがある
– Compactionを実行
 Cannot be downgraded to ACID v1
 Fully compatible with native cloud storage

ACID v2
CREATE TABLE hello_acid (key int, value int)
PARTITIONED BY (load_date date)
CLUSTERED BY(key) INTO 3 BUCKETS
STORED AS ORC TBLPROPERTIES ('transactional'='true');
CREATE TABLE hello_acid_v2 (load_date date, key int, value int);

HDP3: EDW ingestion pipeline
LLAP
interface
Kafka-Druid-
Hive ingest
Kafka-hive
streaming
ingest
Druid
ACID tables
Real-time analytics
• Druid answers in near real-time
Easy to use
• Query any data via LLAP
• No need to de-ACID tables
• No bucketing required
• Calcite talks SQL
• Materialization just works
• Cache just works

HDP 3 – New Unified Streaming Ingest Pipeline
Unified ingestion connectors
ACID tablesMaterialized viewsReal-time rollup
Streaming Data Historical Data
Hive LLAP Unified SQL
DAS | SuperSet | JDBC
Real-time ingest
 Read from Kafka
 Dual write to Hive and Druid
Real-time analytics
• Druid answers in near real-time
• Hive ACID keeps data in sync
Unified API
• Calcite talks unified SQL
• Optimizer automatically use pre-
computed materializations
Easy to use tooling
• DAS: Manage and Optimize
• SuperSet: Dashboard and reports
• JDBC: Tableau, Excel et al

HDP – Security & Governance
Classification
Prohibition
Time
Location
Policies
PDP
Resource
Cache
Ranger
Manage Access Policies
and Audit Logs
Track Metadata
and Lineage
Atlas Client
Subscribers
to Topic
Gets Metadata
Updates
Atlas
Metastore
Tags
Assets
Entitles
Streams
Pipelines
Feeds
Hive
Tables
HDFS
Files
HBase
Tables
Entities
in Data
Lake
Industry First: Dynamic Tag-based Security Policies

Unique Security Features within HDP for SQL Users
 Control Access to Rows in Hive Tables based on
Context!
 Improve reliability and robustness of HDP by
providing Row Level Security to Hive tables and
reducing surface area of security system
 Restrict data row access based on user
characteristics (e.g. group membership) AND
runtime context
 Use Cases:
• A hospital can create a security policy that allows doctors
to view data rows only for their own patients
• A bank can create a policy to restrict access to rows of
financial data based on the employee's business division,
locale or based on the employee's role
• A multi-tenant application can create logical separation of
each tenant's each tenant can see only its data rows.
 Protect Sensitive Data in real-time with Dynamic
Data Masking/Obfuscation!
 Mask or anonymize sensitive columns of data
(e.g. PII, PCI, PHI) from Hive query output
 Benefits
• Sensitive information never leaves database
• No changes are required at the application or Hive
layer
• No need to produce additional protected duplicate
versions of datasets
• Simple & easy to setup masking policies
Row Level Security in Hive Dynamic Data Masking of Hive Columns
R A N G E R H I V E

Security: Dynamic Row Filtering & Column Masking
User 2: Ivanna
Location : EU
Group: HRUser 1: Joe
Location : US
Group: Analyst
Original Query:
SELECT country, nationalid,
ccnumber, mrn, name FROM
ww_customers
Country National
ID
CC No DOB MRN Name Policy ID
US 232323233 4539067047629850 9/12/1969 8233054331 John Doe nj23j424
US 333287465 5391304868205600 8/13/1979 3736885376 Jane Doe cadsd984
Germany T22000129 4532786256545550 3/5/1963 876452830A Ernie Schwarz KK-2345909
Country National ID CC No MRN Name
US xxxxx3233 4539 xxxx xxxx xxxx null John Doe
US xxxxx7465 5391 xxxx xxxx xxxx null Jane Doe
Ranger Policy Enforcement
Query Rewritten based on Dynamic
Ranger Policies: Filter rows by region
& apply relevant column masking
Users from US Analyst group see data
for US persons with CC and National ID
(SSN) as masked values and MRN is
nullified
Country National ID Name MRN
Germany T22000129 Ernie Schwarz 876452830A
EU HR Policy Admins can see
unmasked but are restricted
by row filtering policies to see
data for EU persons only
Original Query:
SELECT country, nationalid,
name, mrn FROM
ww_customers

Data analytics studio
data plane servicesの一部として、
クラウドでもOn-Premでも利用可能
2018年9月GAになりました
Hortonworks Data Analytics Studio
HORTONWORKS DATAPLANE SERVICE
DATA SOURCE INTEGRATION
DATA SERVICES CATALOG
…DATA
LIFECYCLE
MANAGER
DATA
STEWARD
STUDIO
+OTHER
(partner)
SECURITY CONTROLS
CORE CAPABILITIES
MULTIPLE CLUSTERS AND SOURCES
MULTIHYBRID
*not yet available, coming soon
EXTENSIBLE SERVICES
IBM DSX*
DATA
ANALYTICS
STUDIO

© Hortonworks Inc. 2011- 2017. All rights reserved | 47
Why is my query slow?
Noisy neighbors Poor schema Inefficient queries Unstable demand
Expensive
Query log
Storage
Optimizations
Query
Optimizations
Demand
Shifting
Hortonworks Data Analytics Studio
Optimize Your Hive Workloads
Part of the Hortonworks DataPlane Services

SOLUTIONS: Full featured Auto-complete, results
direct download, quick-data preview and many other
quality-of-life improvements.
Data Analytics Studio (DAS)Data Analytics Studio (DAS)

SOLUTIONS: Data Analytics Studio gives database
heatmap, quickly discover and see what part of your
cluster is being utilized more

SOLUTIONS: Pre-defined searches to quickly narrow
down problematic queries in a large cluster

SOLUTIONS: Heuristic recommendation engine
Fully self-serviced query and storage optimization

SOLUTIONS: Built-in batch operations
No more scripting needed for day-to-day operations

まとめ

Hive 3 - Scalable Data Warehousing on Hadoop
Capabilities
Batch SQL OLAP / CubeInteractive SQL
Sub-Second
SQL
(Hive LLAP)
ACID / MERGE
Applications
• ETL
• Reporting
• Data Mining
• Deep Analytics
• Multidimensional
Analytics
• MDX Tools
• Excel
• Reporting
• BI Tools: Tableau,
Microstrategy,
Cognos
• Ad-Hoc
• Drill-Down
• BI Tools: Tableau,
Excel
• Continuous
Ingestion from
Operational DBMS
• Slowly Changing
DimensionsCore
Platform
Scale-Out Storage
Petabyte Scale
Processing
Core SQL Engine
Apache Tez: Scalable
Distributed Processing
Advanced Cost-Based
Optimizer
Connectivity
Advanced Security
JDBC / ODBC
Comprehensive
SQL:2011 Coverage

© Hortonworks Inc. 2011- 2017. All rights reserved | 55
Summary
• Our vision of future EDW is a unified, open source data access layer that
works with across technologies and in a hybrid model
• Druid, Kafka and Hive integration enables real-time analytics on event
streams
• Offloading is still the primary use case, Hive is becoming a full featured
database
• ACID on by default enables data change at scale, key to support GDPR
• Usability and visibility with release of Data analytics studio (DAS)

Questions?

Thank you

Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善

Similar to Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善 (20)

Recently uploaded

Recently uploaded (20)

Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善