SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011–2018. All rights reserved
Hive 3.0
- HDPの最新バージョンで実現する新機能
とパフォーマンス改善
Zhen Zeng
Solution Engineer
2018/09/21
2 © Hortonworks Inc. 2011–2018. All rights reserved
Agenda
• 自己紹介
• Hive基礎
• Hive 3の新機能
• まとめ
3 © Hortonworks Inc. 2011–2018. All rights reserved
自己紹介
4 © Hortonworks Inc. 2011–2018. All rights reserved
自己紹介
Zhen Zeng(曾臻)
Hortonworksソリューションエンジニア。
これまでは、ヤフー、ITコンサルティングファーム、
SIerにてエンジニアを従事。
ビッグ・データ、データガバナンス、PaaS、
Webアプリケーションなどのアーキテクト、
設計や実装の経験を有す。
5 © Hortonworks Inc. 2011–2018. All rights reserved
Hive基礎
6 © Hortonworks Inc. 2011–2018. All rights reserved
▪ 現在Hiveを使っている人?
▪ Hiveがバッチしか使えないと思っている人?
質問
SQL
7 © Hortonworks Inc. 2011–2018. All rights reserved
▪ Powerful
▪ Familiar
▪ Flexible
▪ 普及している
▪ 周辺ツールが豊富 - Deep Ecosystem
SQL Is King
SQL
8 © Hortonworks Inc. 2011–2018. All rights reserved
RDBMS vs SQL on Hadoop
SQL Engine
Data in
HDFS
Meta Data
e.g. MySQL
Data (tables)
Meta Data
SQL Engine
SQL on Hadoop
(could be in three separate systems)
RDBMS
(one logical system)
Client Client
9 © Hortonworks Inc. 2011–2018. All rights reserved
SQL on Hadoop – Schema on Read
Client
SQL
Engine
HDFS Meta Data
e.g. MySQL
Client
SQL on Hadoop
(Schema on Read)
RDBMS
(Schema on Write)
1 CREATE TABLE …
2 INSERT INTO …
CHECK
SCHEMA
✔
3 SELECT …
2 CREATE TABLE …
1 Ingest data into HDFS
3 SELECT …
CHECK
SCHEMA
✔Flume
10 © Hortonworks Inc. 2011–2018. All rights reserved
SQL on Hadoop – 分散処理
Hive
Server
Data
Node
Client
...
Data
Node
Data
Node
Data
Node
Data
Node
machine 2 machine 3 machine 4 machine 5 machine n
Client
RDBMS
machine 1
machine 1
11 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Hiveとは?
Apache Hive : SQL gateway to Hadoop
Features:
• Extensive SQL:2011 Support
• ACID Transactions
• In-Memory Caching
• Cost-Based Optimizer
• User-Based Dynamic Security
• Replication and Disaster Recovery
• JDBC and ODBC Support
• Compatible with every major BI Tool
• 300+ PB Scaleのデータでも実績あり
12 © Hortonworks Inc. 2011–2018. All rights reserved
Hive LLAP : Hadoop native solution for Interactive Analytics
• Open Source
• Hadoop Native Integration
• Security
Hive
Server 2
LLAP
YARN
HDFS
Spark
MR
Pig
ETL
Interactive queries
Interactive queries
13 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Hive: Modern ArchitectureStorage
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Engine
SQL Engines
Row Engine Vector Engine
SQL
SQL Support
SQL:2011 Optimizer HCatalog HiveServer2
Cache
Block Cache
Linux Cache
Distributed
Execution
Hadoop 1
MapReduce
Hadoop 2
Tez Spark
Vector Cache
LLAP
Persistent Server
Historical
Current
Legend
BI Tool
JDBC
ODBC
14 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Hive: Fast Facts
Most Queries Per Hour
100,000 Queries Per Hour
(Yahoo Japan)
分析パフォーマンス
1億行/秒 Per Node
(with Hive LLAP)
Largest Hive Warehouse
300+ PB Raw Storage
(Facebook)
Largest Cluster
4,500+ Nodes
(Yahoo)
15 © Hortonworks Inc. 2011–2018. All rights reserved
Hiveの進化: MR, Tez, Tez + LLAP
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
T T T
R R
R
T T
T
R
M M M
R R
R
M M
R
R
HDFS
In-Memory
columnar cache
Map – Reduce
Intermediate results in HDFS
Tez
Optimized Pipeline
Tez with LLAP
Resident process on Nodes
Map tasks
read HDFS
「HiveがBatchしか出来ない」は
過去の歴史。Hive 2から処理速
度が劇的に改善
16 © Hortonworks Inc. 2011–2018. All rights reserved
Continuous evolution towards scalable performance
= 100X
Hive + MR Hive + Tez Hive + Tez + LLAP
= 20X
Hive 1.0 Hive 1.3 Hive 2.0
Batch SQL 分析SQL Interactive SQL
• ETL
• レポート作成
• Data Mining
• 深い分析
• レポート作成
• BI Tools:
Microstrategy,
Cognos
• Ad-Hoc
• Drill-Down
• Agile BI Tools:
Tableau, Power BI
Hive 3.0
Interactive SQL
• Ad-Hoc
• Result Cache
• Better W/L Mangt.
• Agile BI Tools:
Tableau, Power BI
HDP 3
17 © Hortonworks Inc. 2011–2018. All rights reserved
Hive 3の新機能
18 © Hortonworks Inc. 2011–2018. All rights reserved
大幅に進化したHive 3
• ユースケースが更に増えた
• EDW Offload
• Interactive Query
• OLAP Query
• Real-time ingestion
• Unified SQL
• Data Federation (SQLServer,
Oracle, etc)
• Spark-hive Connector
• 高性能
• Low latency
• Fast response time
• Cloud Native
• S3, GCS, Azure
Real-Time Data Streams+
Workload Management+
ACID Transactions+
Materialized Views+
Scales Horizontally to Petabytes+
19 © Hortonworks Inc. 2011–2018. All rights reserved
Hive LLAP – MPP Performance at Hadoop Scale
Deep
Storage
YARN Cluster
Resource Mgmt LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
HiveServer2
(Query
Endpoint)
ODBC /
JDBC
SQL
Queries In-Memory
Cache
HDFS and
Compatible
S3 WASB Isilon
BI Pool
ETL Pool
Background
Pool
同じクラスタで簡単にBatch とInteractive を両方実行できる
In-Memory
Cache
In-Memory
Cache
20 © Hortonworks Inc. 2011–2018. All rights reserved
EDW analyst pipeline
Tableau
BI systems
Materialized
view
Surrogate key
(代替キー )
Constrains
Query
Result
Cache
Workload
management
ACID v2
&
ACID on
default
• Results return
from HDFS/cache
directly
• Reduce load from
repetitive queries
• 同時に実行でき
るクエリ数が更に
増えた
• Reduce resource
starvation in large
clusters
• Also:
Active/Passive HA
• More “tools” for
optimizer to use
• More ”tools” for
DBAs to
tune/optimize
• Invisible tuning of
DB from users’
perspective
• ACID v2 is as fast
as regular tables
21 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 21
HIVE-18513: Query 結果 cache
実際クエリを実行せずに、ストレージからク
エリ結果を直接返す (e.g. HDFS)
前提:
同じクエリが実行されたことがある
ダッシュボード、レポートでの利用時、重複
クエリがよくあるので、リソース節約&処理
パフォーマンス向上に役に立つ
Without
cache
With
cache
22 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 22
HIVE-18513: Query result cache details
⬢ hive.query.results.cache.enabled=true (on by default)
⬢ hive managed tablesのみ有効
– If you JOIN an external table with Hive managed table, Hive will fall back to executing the
full query. Because Hive can’t know if external table data has changed
⬢ Works with ACID
– That means if Hive table has been updated, the query will be rerun automatically
⬢ LLAP cacheと違う
– LLAP cache は読み込みデータのcache. That means multiple queries can benefit by
avoiding reading from disk. Speeds up the read path.
– Result cache effectively bypasses execution of query
⬢ Stored at /tmp/hive/__resultcache__/, default space is 2GB, LRU eviction
– hive.query.results.cache.max.size (bytes)で設定変更可能
23 © Hortonworks Inc. 2011–2018. All rights reserved
LLAP Resource Management in 3.x
• リソースプラン
例:
• Daytime
• Nightime
• EndOfQuarter
• リソースプール
• Capacity based
• Fair or FIFO scheduling
• Automatic mapping
• Map query to pool
• User | Group | Application
• Triggers to
• Move queries
• Kill queries
例:出力結果
が大きすぎる
例:実行時間
が長過ぎる
24 © Hortonworks Inc. 2011–2018. All rights reserved
リソースプランの例
CREATE RESOURCE PLAN daytime;
CREATE POOL daytime.bi WITH ALLOC_FRACTION=0.8, QUERY_PARALLELISM=5;
CREATE POOL daytime.etl WITH ALLOC_FRACTION=0.2, QUERY_PARALLELISM=20;
CREATE RULE downgrade IN daytime WHEN total_runtime > 300 THEN MOVE etl;
ADD RULE downgrade TO bi;
CREATE APPLICATION MAPPING tableau in daytime TO bi;
ALTER PLAN daytime SET default pool= etl;
APPLY PLAN daytime;
daytime
bi: 75% etl: 25%
Downgrade when total_runtime>300
QUEUEの移動
25 © Hortonworks Inc. 2011–2018. All rights reserved
HIVE-17481: LLAP workload management
⬢ LLAP cluster リソースを効率良く共有する
– Resource allocation per user policy; separate ETL and BI, etc.
⬢ Resources based guardrails
– Protect against long running queries, high memory usage
⬢ Improved, query-aware scheduling
– Scheduler is aware of query characteristics, types, etc.
– Fragments easy to pre-empt compared to containers
– Queries はクラスタから決めた割合のリソースが保証され
る、更に空いているリソースも無駄なく使える
26 © Hortonworks Inc. 2011–2018. All rights reserved
Concurrency向上のチューニング
 NVMe SSDs
• Metastore DB backend
• 50x to 60x improvements in query cache performance (hive_locks)
• Namenode
• 5-6x improvement in JDBC startup performance
• Keep Namenode edit logs on SSD
• Zookeeper
• RM State store
• HS2 active/passive info
• LLAP service registry
• HDFS
• /tmp folder
• Yarn logs and Yarn local
 複数台 HS2サーバーの併用
• Doesn’t support Workload Management in HDP 3.0
27 © Hortonworks Inc. 2011–2018. All rights reserved
EDW Features
28 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized Views & DW optimizations
• MVsでaggregates とjoins を加速
• View navigation via CBO/Calcite
• Optionally allow rewrites against
out-of-date materializations
29 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 29
Materialized view
How many unique city-pairs are there?
SELECT count(*)/2
FROM (
SELECT dest,origin,count(*)
FROM flights_hdfs
GROUP BY dest,origin
) as T;
Sub-query can be materialized
CREATE MATERIALIZED VIEW mv1
AS
SELECT dest,origin,count(*)
FROM flights_hdfs
GROUP BY dest,origin;
30 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 30
Materialized view navigation
The query planner will automatically navigate to existing views
31 © Hortonworks Inc. 2011–2018. All rights reserved
Hive + Druid : One SQL Interface Across Real-Time and Historical
OLAP Cubes SQL Tables
Streaming Data Historical Data
Unified SQL Layer
Pre-Aggregate ACID MERGE
Easily ingest event
data into OLAP cubes
Keep data up-to-date
with Hive MERGE
Build OLAP Cubes from Hive
Archive data to Hive for history
Run OLAP queries in real-time
or Deep Analytics over all history
Deep AnalyticsReal-Time Query
32 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 32
Information schema
Question:
どのdatabaseのどのtableが”ssn”を含むカラムを持っているか、洗い出せる?
SELECT columns.table_schema, columns.table_name
FROM information_schema.columns
WHERE column_name LIKE ‘%ssn%’;
This is very useful for EDW offload use cases where some queries depend on
databases’ metadata information.
33 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 33
HIVE-1555: JDBC connector – Data federation
⬢ How did we build the
information_schema?
– We basically mapped part of the
metastore into Hive’s table space!
⬢ Under the hood we used Hive-
JDBC connector
⬢ Read-only for now
⬢ Manual table mapping for now
34 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 34
JDBC Table mapping example
CREATE TABLE HiveTable
(
id INT,
name varchar
)
CREATE EXTERNAL TABLE HiveTable
(
id INT,
name STRING
)
STORED BY
'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
"hive.sql.database.type" = "POSTGRES",
"hive.sql.jdbc.driver"="org.postgresql.Driver",
"hive.sql.jdbc.url"="jdbc:postgresql://hwx-demo-
1.field.hortonworks.com:5432/jdbctest",
"hive.sql.dbcp.username"="jdbctest",
"hive.sql.dbcp.password"="",
"hive.sql.query" = "SELECT ID, NAME FROM
hivetable",
"hive.sql.column.mapping" = "id=ID, name=NAME",
"hive.jdbc.update.on.duplicate" = "true"
);
In Postgres In Hive
35 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 35
Database key 101
How many types of keys can you
name in context of databases?
Why those matter?
⬢ EDW solutions usually depend on
those features
⬢ Keys allows database engine to
make assumptions and run faster
⬢ Essential for building relational
databases.
⬢ Primary key
⬢ Secondary key
⬢ Unique key
⬢ Foreign key
⬢ Composite key
⬢ Natural key
⬢ Surrogate key
⬢ Super key
⬢ Candidate key
36 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 36
Surrogate key(代替キー) 生成
⬢ Surrogate key can replace wide, multiple composite keys.
⬢ JOIN on 2 integers are way faster than 2 JOINs on 2 Strings
SELECT ROW_NUMBER() OVER () as row_num, * FROM airlines;
+----------+----------------+----------------------------------------------------+
| row_num | airlines.code | airlines.description |
+----------+----------------+----------------------------------------------------+
| 1 | 02Q | Titan Airways |
| 2 | 04Q | Tradewind Aviation |
| 3 | 05Q | Comlux Aviation, AG |
| 4 | 06Q | Master Top Linhas Aereas Ltd. |
| 5 | 07Q | Flair Airlines Ltd. |
| 6 | 09Q | Swift Air, LLC |
| 7 | 0BQ | DCA |
| 8 | 0CQ | ACM AIR CHARTER GmbH |
37 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 37
NOT NULL と 制約(CONSTRAINT)
⬢ Essential for data integrity
⬢ Only works on ACID and Append-only tables
⬢ hive.constraint.notnull.enforce = true
Example:
CREATE TABLE Persons (
ID Int NOT NULL,
Name String NOT NULL,
Age Int
);
38 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 38
Default value
⬢ Ensures a value exists
⬢ Can be overwritten in INSERT/UPDATE
statements
⬢ Useful in EDW offload cases
Example:
CREATE TABLE Persons_default (
ID Int NOT NULL,
Name String NOT NULL,
Age Int,
Creator String DEFAULT CURRENT_USER(),
CreateDate Date DEFAULT CURRENT_DATE()
);
39 © Hortonworks Inc. 2011–2018. All rights reserved
ACID v2
 Performance just as good as regular non-ACID tables
• Simpler solution, no more requirement for bucketing
 There are two parts to ACIDv2 –
• one is insert-only ACID and other is full CRUD ACID
 Insert-only ACID is support for *ALL* formats
• Parquet
• Avro
• ORC
• Text
 Enables new optimizations
• Incremental updates of MV & query cache
• Query cache の一貫性(consistency)
 差分ファイル沢山ある場合、パフォーマンスが下がる恐
れがある
– Compactionを実行
 Cannot be downgraded to ACID v1
 Fully compatible with native cloud storage
40 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 40
ACID v2
CREATE TABLE hello_acid (key int, value int)
PARTITIONED BY (load_date date)
CLUSTERED BY(key) INTO 3 BUCKETS
STORED AS ORC TBLPROPERTIES ('transactional'='true');
CREATE TABLE hello_acid_v2 (load_date date, key int, value int);
41 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 41
HDP3: EDW ingestion pipeline
LLAP
interface
Kafka-Druid-
Hive ingest
Kafka-hive
streaming
ingest
Druid
ACID tables
Real-time analytics
• Druid answers in near real-time
Easy to use
• Query any data via LLAP
• No need to de-ACID tables
• No bucketing required
• Calcite talks SQL
• Materialization just works
• Cache just works
42 © Hortonworks Inc. 2011–2018. All rights reserved
HDP 3 – New Unified Streaming Ingest Pipeline
Unified ingestion connectors
ACID tablesMaterialized viewsReal-time rollup
Streaming Data Historical Data
Hive LLAP Unified SQL
DAS | SuperSet | JDBC
Real-time ingest
 Read from Kafka
 Dual write to Hive and Druid
Real-time analytics
• Druid answers in near real-time
• Hive ACID keeps data in sync
Unified API
• Calcite talks unified SQL
• Optimizer automatically use pre-
computed materializations
Easy to use tooling
• DAS: Manage and Optimize
• SuperSet: Dashboard and reports
• JDBC: Tableau, Excel et al
43 © Hortonworks Inc. 2011–2018. All rights reserved
HDP – Security & Governance
Classification
Prohibition
Time
Location
Policies
PDP
Resource
Cache
Ranger
Manage Access Policies
and Audit Logs
Track Metadata
and Lineage
Atlas Client
Subscribers
to Topic
Gets Metadata
Updates
Atlas
Metastore
Tags
Assets
Entitles
Streams
Pipelines
Feeds
Hive
Tables
HDFS
Files
HBase
Tables
Entities
in Data
Lake
Industry First: Dynamic Tag-based Security Policies
44 © Hortonworks Inc. 2011–2018. All rights reserved
Unique Security Features within HDP for SQL Users
 Control Access to Rows in Hive Tables based on
Context!
 Improve reliability and robustness of HDP by
providing Row Level Security to Hive tables and
reducing surface area of security system
 Restrict data row access based on user
characteristics (e.g. group membership) AND
runtime context
 Use Cases:
• A hospital can create a security policy that allows doctors
to view data rows only for their own patients
• A bank can create a policy to restrict access to rows of
financial data based on the employee's business division,
locale or based on the employee's role
• A multi-tenant application can create logical separation of
each tenant's each tenant can see only its data rows.
 Protect Sensitive Data in real-time with Dynamic
Data Masking/Obfuscation!
 Mask or anonymize sensitive columns of data
(e.g. PII, PCI, PHI) from Hive query output
 Benefits
• Sensitive information never leaves database
• No changes are required at the application or Hive
layer
• No need to produce additional protected duplicate
versions of datasets
• Simple & easy to setup masking policies
Row Level Security in Hive Dynamic Data Masking of Hive Columns
R A N G E R H I V E
45 © Hortonworks Inc. 2011–2018. All rights reserved
Security: Dynamic Row Filtering & Column Masking
User 2: Ivanna
Location : EU
Group: HRUser 1: Joe
Location : US
Group: Analyst
Original Query:
SELECT country, nationalid,
ccnumber, mrn, name FROM
ww_customers
Country National
ID
CC No DOB MRN Name Policy ID
US 232323233 4539067047629850 9/12/1969 8233054331 John Doe nj23j424
US 333287465 5391304868205600 8/13/1979 3736885376 Jane Doe cadsd984
Germany T22000129 4532786256545550 3/5/1963 876452830A Ernie Schwarz KK-2345909
Country National ID CC No MRN Name
US xxxxx3233 4539 xxxx xxxx xxxx null John Doe
US xxxxx7465 5391 xxxx xxxx xxxx null Jane Doe
Ranger Policy Enforcement
Query Rewritten based on Dynamic
Ranger Policies: Filter rows by region
& apply relevant column masking
Users from US Analyst group see data
for US persons with CC and National ID
(SSN) as masked values and MRN is
nullified
Country National ID Name MRN
Germany T22000129 Ernie Schwarz 876452830A
EU HR Policy Admins can see
unmasked but are restricted
by row filtering policies to see
data for EU persons only
Original Query:
SELECT country, nationalid,
name, mrn FROM
ww_customers
46 © Hortonworks Inc. 2011–2018. All rights reserved
Data analytics studio
data plane servicesの一部として、
クラウドでもOn-Premでも利用可能
2018年9月GAになりました
Hortonworks Data Analytics Studio
HORTONWORKS DATAPLANE SERVICE
DATA SOURCE INTEGRATION
DATA SERVICES CATALOG
…DATA
LIFECYCLE
MANAGER
DATA
STEWARD
STUDIO
+OTHER
(partner)
SECURITY CONTROLS
CORE CAPABILITIES
MULTIPLE CLUSTERS AND SOURCES
MULTIHYBRID
*not yet available, coming soon
EXTENSIBLE SERVICES
IBM DSX*
DATA
ANALYTICS
STUDIO
© Hortonworks Inc. 2011- 2017. All rights reserved | 47
Why is my query slow?
Noisy neighbors Poor schema Inefficient queries Unstable demand
Expensive
Query log
Storage
Optimizations
Query
Optimizations
Demand
Shifting
Hortonworks Data Analytics Studio
Optimize Your Hive Workloads
Part of the Hortonworks DataPlane Services
48 © Hortonworks Inc. 2011–2018. All rights reserved
SOLUTIONS: Full featured Auto-complete, results
direct download, quick-data preview and many other
quality-of-life improvements.
Data Analytics Studio (DAS)Data Analytics Studio (DAS)
49 © Hortonworks Inc. 2011–2018. All rights reserved
SOLUTIONS: Data Analytics Studio gives database
heatmap, quickly discover and see what part of your
cluster is being utilized more
Data Analytics Studio (DAS)Data Analytics Studio (DAS)
50 © Hortonworks Inc. 2011–2018. All rights reserved
SOLUTIONS: Pre-defined searches to quickly narrow
down problematic queries in a large cluster
Data Analytics Studio (DAS)Data Analytics Studio (DAS)
51 © Hortonworks Inc. 2011–2018. All rights reserved
SOLUTIONS: Heuristic recommendation engine
Fully self-serviced query and storage optimization
Data Analytics Studio (DAS)Data Analytics Studio (DAS)
52 © Hortonworks Inc. 2011–2018. All rights reserved
SOLUTIONS: Built-in batch operations
No more scripting needed for day-to-day operations
Data Analytics Studio (DAS)Data Analytics Studio (DAS)
53 © Hortonworks Inc. 2011–2018. All rights reserved
まとめ
54 © Hortonworks Inc. 2011–2018. All rights reserved
Hive 3 - Scalable Data Warehousing on Hadoop
Capabilities
Batch SQL OLAP / CubeInteractive SQL
Sub-Second
SQL
(Hive LLAP)
ACID / MERGE
Applications
• ETL
• Reporting
• Data Mining
• Deep Analytics
• Multidimensional
Analytics
• MDX Tools
• Excel
• Reporting
• BI Tools: Tableau,
Microstrategy,
Cognos
• Ad-Hoc
• Drill-Down
• BI Tools: Tableau,
Excel
• Continuous
Ingestion from
Operational DBMS
• Slowly Changing
DimensionsCore
Platform
Scale-Out Storage
Petabyte Scale
Processing
Core SQL Engine
Apache Tez: Scalable
Distributed Processing
Advanced Cost-Based
Optimizer
Connectivity
Advanced Security
JDBC / ODBC
Comprehensive
SQL:2011 Coverage
© Hortonworks Inc. 2011- 2017. All rights reserved | 55
Summary
• Our vision of future EDW is a unified, open source data access layer that
works with across technologies and in a hybrid model
• Druid, Kafka and Hive integration enables real-time analytics on event
streams
• Offloading is still the primary use case, Hive is becoming a full featured
database
• ACID on by default enables data change at scale, key to support GDPR
• Usability and visibility with release of Data analytics studio (DAS)
56 © Hortonworks Inc. 2011–2018. All rights reserved
Questions?
57 © Hortonworks Inc. 2011–2018. All rights reserved
Thank you

More Related Content

What's hot

Dataflow with Apache NiFi - Crash Course - HS16SJ
Dataflow with Apache NiFi - Crash Course - HS16SJDataflow with Apache NiFi - Crash Course - HS16SJ
Dataflow with Apache NiFi - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Aldrin Piri
 
Apache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionApache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the Union
DataWorks Summit
 
Deep learning on HDP 2018 Prague
Deep learning on HDP 2018 PragueDeep learning on HDP 2018 Prague
Deep learning on HDP 2018 Prague
Timothy Spann
 
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFiIntelligently Collecting Data at the Edge – Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
DataWorks Summit
 
The Avant-garde of Apache NiFi
The Avant-garde of Apache NiFiThe Avant-garde of Apache NiFi
The Avant-garde of Apache NiFi
DataWorks Summit/Hadoop Summit
 
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFiIntelligently Collecting Data at the Edge - Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
DataWorks Summit
 
Apache NiFi: Ingesting Enterprise Data At Scale
Apache NiFi:   Ingesting Enterprise Data At Scale Apache NiFi:   Ingesting Enterprise Data At Scale
Apache NiFi: Ingesting Enterprise Data At Scale
Timothy Spann
 
Apache NiFi Crash Course - San Jose Hadoop Summit
Apache NiFi Crash Course - San Jose Hadoop SummitApache NiFi Crash Course - San Jose Hadoop Summit
Apache NiFi Crash Course - San Jose Hadoop Summit
Aldrin Piri
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Connect Data and Devices with Apache NiFi
Connect Data and Devices with Apache NiFiConnect Data and Devices with Apache NiFi
Connect Data and Devices with Apache NiFi
Data Works MD
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
Aldrin Piri
 
Open Source Predictive Analytics Pipeline with Apache NiFi and MiniFi Princeton
Open Source Predictive Analytics Pipeline with Apache NiFi and MiniFi PrincetonOpen Source Predictive Analytics Pipeline with Apache NiFi and MiniFi Princeton
Open Source Predictive Analytics Pipeline with Apache NiFi and MiniFi Princeton
Timothy Spann
 
REAL-TIME INGESTING AND TRANSFORMING SENSOR DATA & SOCIAL DATA w/ NIFI + TENS...
REAL-TIME INGESTING AND TRANSFORMING SENSOR DATA & SOCIAL DATA w/ NIFI + TENS...REAL-TIME INGESTING AND TRANSFORMING SENSOR DATA & SOCIAL DATA w/ NIFI + TENS...
REAL-TIME INGESTING AND TRANSFORMING SENSOR DATA & SOCIAL DATA w/ NIFI + TENS...
Timothy Spann
 
Introduction to HDF 3.0
Introduction to HDF 3.0Introduction to HDF 3.0
Introduction to HDF 3.0
Timothy Spann
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
Manish Gupta
 
Data at Scales and the Values of Starting Small with Apache NiFi & MiNiFi
Data at Scales and the Values of Starting Small with Apache NiFi & MiNiFiData at Scales and the Values of Starting Small with Apache NiFi & MiNiFi
Data at Scales and the Values of Starting Small with Apache NiFi & MiNiFi
Aldrin Piri
 
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
Timothy Spann
 
IoT Crash Course Hadoop Summit SJ
IoT Crash Course Hadoop Summit SJIoT Crash Course Hadoop Summit SJ
IoT Crash Course Hadoop Summit SJ
Daniel Madrigal
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
Hortonworks
 

What's hot (20)

Dataflow with Apache NiFi - Crash Course - HS16SJ
Dataflow with Apache NiFi - Crash Course - HS16SJDataflow with Apache NiFi - Crash Course - HS16SJ
Dataflow with Apache NiFi - Crash Course - HS16SJ
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
 
Apache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionApache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the Union
 
Deep learning on HDP 2018 Prague
Deep learning on HDP 2018 PragueDeep learning on HDP 2018 Prague
Deep learning on HDP 2018 Prague
 
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFiIntelligently Collecting Data at the Edge – Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
 
The Avant-garde of Apache NiFi
The Avant-garde of Apache NiFiThe Avant-garde of Apache NiFi
The Avant-garde of Apache NiFi
 
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFiIntelligently Collecting Data at the Edge - Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
 
Apache NiFi: Ingesting Enterprise Data At Scale
Apache NiFi:   Ingesting Enterprise Data At Scale Apache NiFi:   Ingesting Enterprise Data At Scale
Apache NiFi: Ingesting Enterprise Data At Scale
 
Apache NiFi Crash Course - San Jose Hadoop Summit
Apache NiFi Crash Course - San Jose Hadoop SummitApache NiFi Crash Course - San Jose Hadoop Summit
Apache NiFi Crash Course - San Jose Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Connect Data and Devices with Apache NiFi
Connect Data and Devices with Apache NiFiConnect Data and Devices with Apache NiFi
Connect Data and Devices with Apache NiFi
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
 
Open Source Predictive Analytics Pipeline with Apache NiFi and MiniFi Princeton
Open Source Predictive Analytics Pipeline with Apache NiFi and MiniFi PrincetonOpen Source Predictive Analytics Pipeline with Apache NiFi and MiniFi Princeton
Open Source Predictive Analytics Pipeline with Apache NiFi and MiniFi Princeton
 
REAL-TIME INGESTING AND TRANSFORMING SENSOR DATA & SOCIAL DATA w/ NIFI + TENS...
REAL-TIME INGESTING AND TRANSFORMING SENSOR DATA & SOCIAL DATA w/ NIFI + TENS...REAL-TIME INGESTING AND TRANSFORMING SENSOR DATA & SOCIAL DATA w/ NIFI + TENS...
REAL-TIME INGESTING AND TRANSFORMING SENSOR DATA & SOCIAL DATA w/ NIFI + TENS...
 
Introduction to HDF 3.0
Introduction to HDF 3.0Introduction to HDF 3.0
Introduction to HDF 3.0
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Data at Scales and the Values of Starting Small with Apache NiFi & MiNiFi
Data at Scales and the Values of Starting Small with Apache NiFi & MiNiFiData at Scales and the Values of Starting Small with Apache NiFi & MiNiFi
Data at Scales and the Values of Starting Small with Apache NiFi & MiNiFi
 
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
 
IoT Crash Course Hadoop Summit SJ
IoT Crash Course Hadoop Summit SJIoT Crash Course Hadoop Summit SJ
IoT Crash Course Hadoop Summit SJ
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
 

Similar to Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善

What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
DataWorks Summit
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData Day
John Park
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
Ashish Narasimham
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
Alberto Romero
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
DataWorks Summit
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
alanfgates
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
Artem Ervits
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
DataWorks Summit
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
DataWorks Summit
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
DataWorks Summit
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
Thejas Nair
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
DataWorks Summit
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
Abdelkrim Hadjidj
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
VMware Tanzu
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
DataWorks Summit
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
alanfgates
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
DataWorks Summit
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2
Hortonworks
 

Similar to Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善 (20)

What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData Day
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat Alwell
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2
 

Recently uploaded

DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 

Recently uploaded (20)

DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 

Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved Hive 3.0 - HDPの最新バージョンで実現する新機能 とパフォーマンス改善 Zhen Zeng Solution Engineer 2018/09/21
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved Agenda • 自己紹介 • Hive基礎 • Hive 3の新機能 • まとめ
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved 自己紹介
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved 自己紹介 Zhen Zeng(曾臻) Hortonworksソリューションエンジニア。 これまでは、ヤフー、ITコンサルティングファーム、 SIerにてエンジニアを従事。 ビッグ・データ、データガバナンス、PaaS、 Webアプリケーションなどのアーキテクト、 設計や実装の経験を有す。
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved Hive基礎
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved ▪ 現在Hiveを使っている人? ▪ Hiveがバッチしか使えないと思っている人? 質問 SQL
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved ▪ Powerful ▪ Familiar ▪ Flexible ▪ 普及している ▪ 周辺ツールが豊富 - Deep Ecosystem SQL Is King SQL
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved RDBMS vs SQL on Hadoop SQL Engine Data in HDFS Meta Data e.g. MySQL Data (tables) Meta Data SQL Engine SQL on Hadoop (could be in three separate systems) RDBMS (one logical system) Client Client
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved SQL on Hadoop – Schema on Read Client SQL Engine HDFS Meta Data e.g. MySQL Client SQL on Hadoop (Schema on Read) RDBMS (Schema on Write) 1 CREATE TABLE … 2 INSERT INTO … CHECK SCHEMA ✔ 3 SELECT … 2 CREATE TABLE … 1 Ingest data into HDFS 3 SELECT … CHECK SCHEMA ✔Flume
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved SQL on Hadoop – 分散処理 Hive Server Data Node Client ... Data Node Data Node Data Node Data Node machine 2 machine 3 machine 4 machine 5 machine n Client RDBMS machine 1 machine 1
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved Apache Hiveとは? Apache Hive : SQL gateway to Hadoop Features: • Extensive SQL:2011 Support • ACID Transactions • In-Memory Caching • Cost-Based Optimizer • User-Based Dynamic Security • Replication and Disaster Recovery • JDBC and ODBC Support • Compatible with every major BI Tool • 300+ PB Scaleのデータでも実績あり
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved Hive LLAP : Hadoop native solution for Interactive Analytics • Open Source • Hadoop Native Integration • Security Hive Server 2 LLAP YARN HDFS Spark MR Pig ETL Interactive queries Interactive queries
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved Apache Hive: Modern ArchitectureStorage Columnar Storage ORCFile Parquet Unstructured Data JSON CSV Text Avro Custom Weblog Engine SQL Engines Row Engine Vector Engine SQL SQL Support SQL:2011 Optimizer HCatalog HiveServer2 Cache Block Cache Linux Cache Distributed Execution Hadoop 1 MapReduce Hadoop 2 Tez Spark Vector Cache LLAP Persistent Server Historical Current Legend BI Tool JDBC ODBC
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved Apache Hive: Fast Facts Most Queries Per Hour 100,000 Queries Per Hour (Yahoo Japan) 分析パフォーマンス 1億行/秒 Per Node (with Hive LLAP) Largest Hive Warehouse 300+ PB Raw Storage (Facebook) Largest Cluster 4,500+ Nodes (Yahoo)
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved Hiveの進化: MR, Tez, Tez + LLAP M M M R R M M R M M R M M R HDFS HDFS HDFS T T T R R R T T T R M M M R R R M M R R HDFS In-Memory columnar cache Map – Reduce Intermediate results in HDFS Tez Optimized Pipeline Tez with LLAP Resident process on Nodes Map tasks read HDFS 「HiveがBatchしか出来ない」は 過去の歴史。Hive 2から処理速 度が劇的に改善
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved Continuous evolution towards scalable performance = 100X Hive + MR Hive + Tez Hive + Tez + LLAP = 20X Hive 1.0 Hive 1.3 Hive 2.0 Batch SQL 分析SQL Interactive SQL • ETL • レポート作成 • Data Mining • 深い分析 • レポート作成 • BI Tools: Microstrategy, Cognos • Ad-Hoc • Drill-Down • Agile BI Tools: Tableau, Power BI Hive 3.0 Interactive SQL • Ad-Hoc • Result Cache • Better W/L Mangt. • Agile BI Tools: Tableau, Power BI HDP 3
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved Hive 3の新機能
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved 大幅に進化したHive 3 • ユースケースが更に増えた • EDW Offload • Interactive Query • OLAP Query • Real-time ingestion • Unified SQL • Data Federation (SQLServer, Oracle, etc) • Spark-hive Connector • 高性能 • Low latency • Fast response time • Cloud Native • S3, GCS, Azure Real-Time Data Streams+ Workload Management+ ACID Transactions+ Materialized Views+ Scales Horizontally to Petabytes+
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved Hive LLAP – MPP Performance at Hadoop Scale Deep Storage YARN Cluster Resource Mgmt LLAP Daemon Query Executors LLAP Daemon Query Executors LLAP Daemon Query Executors Query Coordinators Coord- inator Coord- inator Coord- inator HiveServer2 (Query Endpoint) ODBC / JDBC SQL Queries In-Memory Cache HDFS and Compatible S3 WASB Isilon BI Pool ETL Pool Background Pool 同じクラスタで簡単にBatch とInteractive を両方実行できる In-Memory Cache In-Memory Cache
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved EDW analyst pipeline Tableau BI systems Materialized view Surrogate key (代替キー ) Constrains Query Result Cache Workload management ACID v2 & ACID on default • Results return from HDFS/cache directly • Reduce load from repetitive queries • 同時に実行でき るクエリ数が更に 増えた • Reduce resource starvation in large clusters • Also: Active/Passive HA • More “tools” for optimizer to use • More ”tools” for DBAs to tune/optimize • Invisible tuning of DB from users’ perspective • ACID v2 is as fast as regular tables
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 21 HIVE-18513: Query 結果 cache 実際クエリを実行せずに、ストレージからク エリ結果を直接返す (e.g. HDFS) 前提: 同じクエリが実行されたことがある ダッシュボード、レポートでの利用時、重複 クエリがよくあるので、リソース節約&処理 パフォーマンス向上に役に立つ Without cache With cache
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 22 HIVE-18513: Query result cache details ⬢ hive.query.results.cache.enabled=true (on by default) ⬢ hive managed tablesのみ有効 – If you JOIN an external table with Hive managed table, Hive will fall back to executing the full query. Because Hive can’t know if external table data has changed ⬢ Works with ACID – That means if Hive table has been updated, the query will be rerun automatically ⬢ LLAP cacheと違う – LLAP cache は読み込みデータのcache. That means multiple queries can benefit by avoiding reading from disk. Speeds up the read path. – Result cache effectively bypasses execution of query ⬢ Stored at /tmp/hive/__resultcache__/, default space is 2GB, LRU eviction – hive.query.results.cache.max.size (bytes)で設定変更可能
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved LLAP Resource Management in 3.x • リソースプラン 例: • Daytime • Nightime • EndOfQuarter • リソースプール • Capacity based • Fair or FIFO scheduling • Automatic mapping • Map query to pool • User | Group | Application • Triggers to • Move queries • Kill queries 例:出力結果 が大きすぎる 例:実行時間 が長過ぎる
  • 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved リソースプランの例 CREATE RESOURCE PLAN daytime; CREATE POOL daytime.bi WITH ALLOC_FRACTION=0.8, QUERY_PARALLELISM=5; CREATE POOL daytime.etl WITH ALLOC_FRACTION=0.2, QUERY_PARALLELISM=20; CREATE RULE downgrade IN daytime WHEN total_runtime > 300 THEN MOVE etl; ADD RULE downgrade TO bi; CREATE APPLICATION MAPPING tableau in daytime TO bi; ALTER PLAN daytime SET default pool= etl; APPLY PLAN daytime; daytime bi: 75% etl: 25% Downgrade when total_runtime>300 QUEUEの移動
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved HIVE-17481: LLAP workload management ⬢ LLAP cluster リソースを効率良く共有する – Resource allocation per user policy; separate ETL and BI, etc. ⬢ Resources based guardrails – Protect against long running queries, high memory usage ⬢ Improved, query-aware scheduling – Scheduler is aware of query characteristics, types, etc. – Fragments easy to pre-empt compared to containers – Queries はクラスタから決めた割合のリソースが保証され る、更に空いているリソースも無駄なく使える
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved Concurrency向上のチューニング  NVMe SSDs • Metastore DB backend • 50x to 60x improvements in query cache performance (hive_locks) • Namenode • 5-6x improvement in JDBC startup performance • Keep Namenode edit logs on SSD • Zookeeper • RM State store • HS2 active/passive info • LLAP service registry • HDFS • /tmp folder • Yarn logs and Yarn local  複数台 HS2サーバーの併用 • Doesn’t support Workload Management in HDP 3.0
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved EDW Features
  • 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved Materialized Views & DW optimizations • MVsでaggregates とjoins を加速 • View navigation via CBO/Calcite • Optionally allow rewrites against out-of-date materializations
  • 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 29 Materialized view How many unique city-pairs are there? SELECT count(*)/2 FROM ( SELECT dest,origin,count(*) FROM flights_hdfs GROUP BY dest,origin ) as T; Sub-query can be materialized CREATE MATERIALIZED VIEW mv1 AS SELECT dest,origin,count(*) FROM flights_hdfs GROUP BY dest,origin;
  • 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 30 Materialized view navigation The query planner will automatically navigate to existing views
  • 31. 31 © Hortonworks Inc. 2011–2018. All rights reserved Hive + Druid : One SQL Interface Across Real-Time and Historical OLAP Cubes SQL Tables Streaming Data Historical Data Unified SQL Layer Pre-Aggregate ACID MERGE Easily ingest event data into OLAP cubes Keep data up-to-date with Hive MERGE Build OLAP Cubes from Hive Archive data to Hive for history Run OLAP queries in real-time or Deep Analytics over all history Deep AnalyticsReal-Time Query
  • 32. 32 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 32 Information schema Question: どのdatabaseのどのtableが”ssn”を含むカラムを持っているか、洗い出せる? SELECT columns.table_schema, columns.table_name FROM information_schema.columns WHERE column_name LIKE ‘%ssn%’; This is very useful for EDW offload use cases where some queries depend on databases’ metadata information.
  • 33. 33 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 33 HIVE-1555: JDBC connector – Data federation ⬢ How did we build the information_schema? – We basically mapped part of the metastore into Hive’s table space! ⬢ Under the hood we used Hive- JDBC connector ⬢ Read-only for now ⬢ Manual table mapping for now
  • 34. 34 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 34 JDBC Table mapping example CREATE TABLE HiveTable ( id INT, name varchar ) CREATE EXTERNAL TABLE HiveTable ( id INT, name STRING ) STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler' TBLPROPERTIES ( "hive.sql.database.type" = "POSTGRES", "hive.sql.jdbc.driver"="org.postgresql.Driver", "hive.sql.jdbc.url"="jdbc:postgresql://hwx-demo- 1.field.hortonworks.com:5432/jdbctest", "hive.sql.dbcp.username"="jdbctest", "hive.sql.dbcp.password"="", "hive.sql.query" = "SELECT ID, NAME FROM hivetable", "hive.sql.column.mapping" = "id=ID, name=NAME", "hive.jdbc.update.on.duplicate" = "true" ); In Postgres In Hive
  • 35. 35 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 35 Database key 101 How many types of keys can you name in context of databases? Why those matter? ⬢ EDW solutions usually depend on those features ⬢ Keys allows database engine to make assumptions and run faster ⬢ Essential for building relational databases. ⬢ Primary key ⬢ Secondary key ⬢ Unique key ⬢ Foreign key ⬢ Composite key ⬢ Natural key ⬢ Surrogate key ⬢ Super key ⬢ Candidate key
  • 36. 36 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 36 Surrogate key(代替キー) 生成 ⬢ Surrogate key can replace wide, multiple composite keys. ⬢ JOIN on 2 integers are way faster than 2 JOINs on 2 Strings SELECT ROW_NUMBER() OVER () as row_num, * FROM airlines; +----------+----------------+----------------------------------------------------+ | row_num | airlines.code | airlines.description | +----------+----------------+----------------------------------------------------+ | 1 | 02Q | Titan Airways | | 2 | 04Q | Tradewind Aviation | | 3 | 05Q | Comlux Aviation, AG | | 4 | 06Q | Master Top Linhas Aereas Ltd. | | 5 | 07Q | Flair Airlines Ltd. | | 6 | 09Q | Swift Air, LLC | | 7 | 0BQ | DCA | | 8 | 0CQ | ACM AIR CHARTER GmbH |
  • 37. 37 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 37 NOT NULL と 制約(CONSTRAINT) ⬢ Essential for data integrity ⬢ Only works on ACID and Append-only tables ⬢ hive.constraint.notnull.enforce = true Example: CREATE TABLE Persons ( ID Int NOT NULL, Name String NOT NULL, Age Int );
  • 38. 38 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 38 Default value ⬢ Ensures a value exists ⬢ Can be overwritten in INSERT/UPDATE statements ⬢ Useful in EDW offload cases Example: CREATE TABLE Persons_default ( ID Int NOT NULL, Name String NOT NULL, Age Int, Creator String DEFAULT CURRENT_USER(), CreateDate Date DEFAULT CURRENT_DATE() );
  • 39. 39 © Hortonworks Inc. 2011–2018. All rights reserved ACID v2  Performance just as good as regular non-ACID tables • Simpler solution, no more requirement for bucketing  There are two parts to ACIDv2 – • one is insert-only ACID and other is full CRUD ACID  Insert-only ACID is support for *ALL* formats • Parquet • Avro • ORC • Text  Enables new optimizations • Incremental updates of MV & query cache • Query cache の一貫性(consistency)  差分ファイル沢山ある場合、パフォーマンスが下がる恐 れがある – Compactionを実行  Cannot be downgraded to ACID v1  Fully compatible with native cloud storage
  • 40. 40 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 40 ACID v2 CREATE TABLE hello_acid (key int, value int) PARTITIONED BY (load_date date) CLUSTERED BY(key) INTO 3 BUCKETS STORED AS ORC TBLPROPERTIES ('transactional'='true'); CREATE TABLE hello_acid_v2 (load_date date, key int, value int);
  • 41. 41 © Hortonworks Inc. 2011–2018. All rights reserved© Hortonworks Inc. 2011- 2017. All rights reserved | 41 HDP3: EDW ingestion pipeline LLAP interface Kafka-Druid- Hive ingest Kafka-hive streaming ingest Druid ACID tables Real-time analytics • Druid answers in near real-time Easy to use • Query any data via LLAP • No need to de-ACID tables • No bucketing required • Calcite talks SQL • Materialization just works • Cache just works
  • 42. 42 © Hortonworks Inc. 2011–2018. All rights reserved HDP 3 – New Unified Streaming Ingest Pipeline Unified ingestion connectors ACID tablesMaterialized viewsReal-time rollup Streaming Data Historical Data Hive LLAP Unified SQL DAS | SuperSet | JDBC Real-time ingest  Read from Kafka  Dual write to Hive and Druid Real-time analytics • Druid answers in near real-time • Hive ACID keeps data in sync Unified API • Calcite talks unified SQL • Optimizer automatically use pre- computed materializations Easy to use tooling • DAS: Manage and Optimize • SuperSet: Dashboard and reports • JDBC: Tableau, Excel et al
  • 43. 43 © Hortonworks Inc. 2011–2018. All rights reserved HDP – Security & Governance Classification Prohibition Time Location Policies PDP Resource Cache Ranger Manage Access Policies and Audit Logs Track Metadata and Lineage Atlas Client Subscribers to Topic Gets Metadata Updates Atlas Metastore Tags Assets Entitles Streams Pipelines Feeds Hive Tables HDFS Files HBase Tables Entities in Data Lake Industry First: Dynamic Tag-based Security Policies
  • 44. 44 © Hortonworks Inc. 2011–2018. All rights reserved Unique Security Features within HDP for SQL Users  Control Access to Rows in Hive Tables based on Context!  Improve reliability and robustness of HDP by providing Row Level Security to Hive tables and reducing surface area of security system  Restrict data row access based on user characteristics (e.g. group membership) AND runtime context  Use Cases: • A hospital can create a security policy that allows doctors to view data rows only for their own patients • A bank can create a policy to restrict access to rows of financial data based on the employee's business division, locale or based on the employee's role • A multi-tenant application can create logical separation of each tenant's each tenant can see only its data rows.  Protect Sensitive Data in real-time with Dynamic Data Masking/Obfuscation!  Mask or anonymize sensitive columns of data (e.g. PII, PCI, PHI) from Hive query output  Benefits • Sensitive information never leaves database • No changes are required at the application or Hive layer • No need to produce additional protected duplicate versions of datasets • Simple & easy to setup masking policies Row Level Security in Hive Dynamic Data Masking of Hive Columns R A N G E R H I V E
  • 45. 45 © Hortonworks Inc. 2011–2018. All rights reserved Security: Dynamic Row Filtering & Column Masking User 2: Ivanna Location : EU Group: HRUser 1: Joe Location : US Group: Analyst Original Query: SELECT country, nationalid, ccnumber, mrn, name FROM ww_customers Country National ID CC No DOB MRN Name Policy ID US 232323233 4539067047629850 9/12/1969 8233054331 John Doe nj23j424 US 333287465 5391304868205600 8/13/1979 3736885376 Jane Doe cadsd984 Germany T22000129 4532786256545550 3/5/1963 876452830A Ernie Schwarz KK-2345909 Country National ID CC No MRN Name US xxxxx3233 4539 xxxx xxxx xxxx null John Doe US xxxxx7465 5391 xxxx xxxx xxxx null Jane Doe Ranger Policy Enforcement Query Rewritten based on Dynamic Ranger Policies: Filter rows by region & apply relevant column masking Users from US Analyst group see data for US persons with CC and National ID (SSN) as masked values and MRN is nullified Country National ID Name MRN Germany T22000129 Ernie Schwarz 876452830A EU HR Policy Admins can see unmasked but are restricted by row filtering policies to see data for EU persons only Original Query: SELECT country, nationalid, name, mrn FROM ww_customers
  • 46. 46 © Hortonworks Inc. 2011–2018. All rights reserved Data analytics studio data plane servicesの一部として、 クラウドでもOn-Premでも利用可能 2018年9月GAになりました Hortonworks Data Analytics Studio HORTONWORKS DATAPLANE SERVICE DATA SOURCE INTEGRATION DATA SERVICES CATALOG …DATA LIFECYCLE MANAGER DATA STEWARD STUDIO +OTHER (partner) SECURITY CONTROLS CORE CAPABILITIES MULTIPLE CLUSTERS AND SOURCES MULTIHYBRID *not yet available, coming soon EXTENSIBLE SERVICES IBM DSX* DATA ANALYTICS STUDIO
  • 47. © Hortonworks Inc. 2011- 2017. All rights reserved | 47 Why is my query slow? Noisy neighbors Poor schema Inefficient queries Unstable demand Expensive Query log Storage Optimizations Query Optimizations Demand Shifting Hortonworks Data Analytics Studio Optimize Your Hive Workloads Part of the Hortonworks DataPlane Services
  • 48. 48 © Hortonworks Inc. 2011–2018. All rights reserved SOLUTIONS: Full featured Auto-complete, results direct download, quick-data preview and many other quality-of-life improvements. Data Analytics Studio (DAS)Data Analytics Studio (DAS)
  • 49. 49 © Hortonworks Inc. 2011–2018. All rights reserved SOLUTIONS: Data Analytics Studio gives database heatmap, quickly discover and see what part of your cluster is being utilized more Data Analytics Studio (DAS)Data Analytics Studio (DAS)
  • 50. 50 © Hortonworks Inc. 2011–2018. All rights reserved SOLUTIONS: Pre-defined searches to quickly narrow down problematic queries in a large cluster Data Analytics Studio (DAS)Data Analytics Studio (DAS)
  • 51. 51 © Hortonworks Inc. 2011–2018. All rights reserved SOLUTIONS: Heuristic recommendation engine Fully self-serviced query and storage optimization Data Analytics Studio (DAS)Data Analytics Studio (DAS)
  • 52. 52 © Hortonworks Inc. 2011–2018. All rights reserved SOLUTIONS: Built-in batch operations No more scripting needed for day-to-day operations Data Analytics Studio (DAS)Data Analytics Studio (DAS)
  • 53. 53 © Hortonworks Inc. 2011–2018. All rights reserved まとめ
  • 54. 54 © Hortonworks Inc. 2011–2018. All rights reserved Hive 3 - Scalable Data Warehousing on Hadoop Capabilities Batch SQL OLAP / CubeInteractive SQL Sub-Second SQL (Hive LLAP) ACID / MERGE Applications • ETL • Reporting • Data Mining • Deep Analytics • Multidimensional Analytics • MDX Tools • Excel • Reporting • BI Tools: Tableau, Microstrategy, Cognos • Ad-Hoc • Drill-Down • BI Tools: Tableau, Excel • Continuous Ingestion from Operational DBMS • Slowly Changing DimensionsCore Platform Scale-Out Storage Petabyte Scale Processing Core SQL Engine Apache Tez: Scalable Distributed Processing Advanced Cost-Based Optimizer Connectivity Advanced Security JDBC / ODBC Comprehensive SQL:2011 Coverage
  • 55. © Hortonworks Inc. 2011- 2017. All rights reserved | 55 Summary • Our vision of future EDW is a unified, open source data access layer that works with across technologies and in a hybrid model • Druid, Kafka and Hive integration enables real-time analytics on event streams • Offloading is still the primary use case, Hive is becoming a full featured database • ACID on by default enables data change at scale, key to support GDPR • Usability and visibility with release of Data analytics studio (DAS)
  • 56. 56 © Hortonworks Inc. 2011–2018. All rights reserved Questions?
  • 57. 57 © Hortonworks Inc. 2011–2018. All rights reserved Thank you