An Apache Hive Based Data Warehouse

Scalable Data Warehousing on
Hadoop
Alan F. Gates, Co-founder, Hortonworks

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What Do You Expect in a Hadoop Data Warehouse?
Benchmarks focus on two questions:
– How much of the TPC-DS query set can it run?
– How fast can it run it?

What You Expect in a Data Warehouse?
High Performance
SQL 2011
High Storage Capacity
Security
Support for BI,
Cubes, Data Science
Monitoring & Management
Governance
Data Lifecycle Management
Replication & D/R
Workload Management
Data Ingestion

So, back to TPC-DS...
High Performance
SQL 2011

Apache Hive Overview
Apache Hive is a SQL data warehouse engine that
delivers fast, scalable SQL processing on Hadoop and
in the Cloud.
Features:
• Extensive SQL:2011 Support
• ACID Transactions
• In-Memory Caching
• Cost-Based Optimizer
• User-Based Dynamic Security
• JDBC and ODBC Support
• Compatible with every major BI Tool
• Proven at 300+ PB Scale

Apache Hive: Fast Facts
Most Queries Per Hour
100,000 Queries Per Hour
(Yahoo Japan)
Analytics Performance
100 Million rows/s Per Node
(with Hive LLAP)
Largest Hive Warehouse
300+ PB Raw Storage
(Facebook)
Largest Cluster
4,500+ Nodes
(Yahoo)

Data Types SQL Features File Formats Futures
Numeric Core SQL Features Columnar ACID MERGE
FLOAT, DOUBLE Date, Time and Arithmetical Functions ORCFile Multi Subquery
DECIMAL INNER, OUTER, CROSS and SEMI Joins Parquet Scalar Subqueries
INT, TINYINT, SMALLINT, BIGINT Derived Table Subqueries Text Non-Equijoins
BOOLEAN Correlated + Uncorrelated Subqueries CSV INTERSECT / EXCEPT
String UNION ALL Logfile
CHAR, VARCHAR UDFs, UDAFs, UDTFs Nested / Complex Recursive CTEs
BLOB (BINARY), CLOB (String) Common Table Expressions Avro NOT NULL Constraints
Date, Time UNION DISTINCT JSON Default Values
DATE, TIMESTAMP, Interval Types Advanced Analytics XML Multi-statement Transactions
Complex Types OLAP and Windowing Functions Custom Formats
ARRAY / MAP / STRUCT / UNION OLAP: Partition, Order by UDAF Other Features
Nested Data Analytics CUBE and Grouping Sets XPath Analytics
Nested Data Traversal ACID Transactions
Lateral Views INSERT / UPDATE / DELETE
Procedural Extensions Constraints
HPL/SQL Primary / Foreign Key (Non Validated)
Apache Hive: Journey to SQL:2011 Analytics
Legend
New
Future work
Hive 2
Track Hive SQL:2011 Complete: HIVE-13554

Hive 2 with LLAP: Architecture Overview
Deep
Storage
YARN Cluster
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
HiveServer2
(Query
Endpoint)
ODBC /
JDBC
SQL
Queries In-Memory Cache
(Shared Across All Users)
HDFS and
Compatible
S3 WASB Isilon

0
5
10
15
20
25
30
35
40
45
50
0
50
100
150
200
250
Speedup(xFactor)
QueryTime(s)(LowerisBetter)
Hive 2 with LLAP averages 26x faster than Hive 1
Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)
Hive 2 with LLAP: 25+x Performance Boost: Interactive / 1TB Scale

Apache Hive vs. Apache Impala at 10TB
 10TB scale on 10 identical
AWS nodes.
 Hive and Impala showed
similar times on most
smaller queries.
 Hive scaled better, with
many queries completing in
<2m where Impala ran to
timeout (3000s).
Highlights

Apache Hive vs. Presto on a partitioned 1TB dataset.
 Presto lacks basic
performance optimizations
like dynamic partition
pruning.
 On a real dataset / workload
Presto perform poorly
without full re-writes.
 Example: Query 55 without
re-writes = 185.17s, with re-
writes = 16s. LLAP = 1.37s.
Highlights

Hive LLAP: Stable Performance under High Concurrency
4x Queries,
2.8x
Runtime
Difference
5x Queries,
4.6x
Runtime
Difference
Mark
Concurrent
Queries
Average
Runtime
5 7.76s
25 36.24s
100 102.89s

How Much Can it Hold, and Where?
High Storage Capacity

Storage
 Of course HDFS, default in the Hadoop world
 More and more cloud
 Move is copy in S3, but current implementation assumes move is atomic and nearly free
– modifying Hadoop (HADOOP-11694) and Hive (HIVE-14535)
 ACID in the cloud
– Compactor moves a lot of files around, need to optimize
– Need to figure out how streaming ingest works in the cloud
 LLAP, caching much more valuable in the cloud
– Looking at flushing cache to SSD so misses are less costly

Is My Data Safe?
Security

• Wire
encryption
• HDFS
encryption +
Ranger KMS
• Centralized
audit
reporting w/
Apache
Ranger
• Fine grain
access control
with Apache
Ranger
Security today in Hadoop
Authorization
What can I do?
Audit
What did I do?
Data Protection
Can data be encrypted at
rest and over the wire?
• Kerberos
• API security
with Apache
Knox
Authentication
Who am I/prove
it?
Centralized Security Administration w/ Ranger & Knox

Authentication—API Security with Knox
• Eliminates SSH “edge node”
• Central API management
• Central audit control
• Service level authorization
• SSO - SAMLv2, Siteminder
and OAM
• LDAP and AD integration
• SSO for Hadoop UIs (Ranger,
Ambari..)
Apache Knox extends the reach of Hadoop REST API without
Kerberos complexities
Integrated with existing IdM
systems
Single, simple point of
access for a cluster
Centralized and consistent
secure API across one or
more clusters
• Kerberos Encapsulation
• Single Hadoop access point
• REST API hierarchy
• Consolidated API calls
• Multi-cluster support

LLAP Data Access
User ID Region Total Spend
1 East 5,131
2 East 27,828
3 West 55,493
4 West 7,193
5 East 18,193
Apache Ranger: Per-User Row Filtering by Region in Hive
User 2
(East Region)
User 1
(West Region)
Original Query:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
Query Rewrites based on
Dynamic Ranger Policies
Dynamic Rewrite:
AND region = “east”
Dynamic Rewrite:
AND region = “west”

Apache Ranger: Dynamic Data Masking of Hive Columns
R A N G E R
Protect Sensitive Data in real-time with Dynamic Data Masking/Obfuscation!
Goal: Mask or anonymize sensitive columns of data
(e.g. PII, PCI, PHI) from Hive query output
⬢ Benefits
– Sensitive information never leaves database
– No changes are required at the application or Hive layer
– No need to produce additional protected duplicate
versions of datasets
– Simple & easy to setup masking policies
⬢ Core Technologies: Ranger, Hive
AT L A S
H I V E

Dynamic Tag-based Access Policies with Apache Atlas
• Basic Tag policy – PII example. Access and
entitlements must be tag based ABAC and scalable in
implementation.
• Geo-based policy – Policy based on IP address, proxy
IP substitution maybe required. The rule
enforcement must be geo aware.
• Time-based policy – Timer for data access, de-
coupled from deletion of data.
• Prohibitions – Prevention of combination of Hive
tables that may pose a risk together.
Key Benefits:
New scalable metadata
based security paradigm
Dynamic, real-time policy
Active protection – fast
updates to changes
Centralized and simple to
manage policy

What’s There and Where Did It Come From?
Governance

Sqoop
Teradata
Connector
Apache
Kafka
Apache Atlas: Cross-Component Dataset Lineage
Custom
Activity
Reporter
Metadata
Repository
RDBMS
Any process
using Sqoop is
covered
No other tool
tracks IOT out
of the box

Apache Atlas Enables Business Catalog for Ease of Use
 Organize data assets along business terms
– Authoritative: Hierarchical Taxonomy Creation
– Agile modeling: Model Conceptual, Logical, Physical assets
– Definition and assignment of tags like PII (Personally
Identifiable Information)
 Comprehensive features for compliance
– Multiple user profiles including Data Steward and Business
Analysts
– Object auditing to track “Who did it”
– Metadata Versioning to track ”what did they do”
 Faster Insight:
– Data Quality tab for profiling and sampling
– User Comments
Key Benefits:
Organize data assets along
business terms
Compliance Features
Faster Insight

How Will My Users Interact With It?
Support for BI,
Cubes, Data Science

Druid: Deep Multidimensional Analytics
Real-Time
Analytics
Hive /
Spark
BI Tools
REST
API
Superset
UI
Events
Logs
Trans-
actions
Sensors
Historical
Sources
HDFS S3
Druid Data Cubes
Ultra-Fast Analytics
Slice-and-Dice
Streaming
Sources
Storm
Kafka Spark
Deep, Fast Drilldown
Across Any Dimension
Scalably Ingest Historical Data from
Transactional and Web Systems
= Future

Druid’s Role in Scalable Data Warehousing
UI
Core Platform
S3 or HDFS
HiveServer2
MDX
Unified SQL and MDX Layer
SQL BI Tools MDX Tools
Hive
Realtime Feeds
(Kafka, Storm, etc.)
Druid
OLAP Indexes
HiveServer2
Hive SQL
Thrift Server
SparkSQL
Fast SQL MDX
Superset UI
Fast Exploration
Ranger
Atlas
Ambari
Management

Analytics at Scale with No Data Movement
Syncsort
High-Performance
Data Movement
Hadoop
Scalable Storage and Compute
Hive LLAP
High Performance SQL
AtScale Intelligence Platform
OLAP Cubes for Higher Performance
Source Data
Systems
Fast, scalable SQL analytics
Intelligent in-memory caching
Define OLAP cubes for 10x faster queries
Unified semantic layer for all BI tools
High performance data import
from all major EDW platforms
Pre-aggregated
data
... Or, full-fidelity
data

Spark Column Security with LLAP
 Fine-Grained Column Level Access Control for SparkSQL.
 Fully dynamic policies per user. Doesn’t require views.
 Use Standard Ranger policies and tools to control access and masking policies.
Flow:
1. SparkSQL gets data locations
known as “splits” from HiveServer
and plans query.
2. HiveServer2 authorizes access
using Ranger. Per-user policies
like row filtering are applied.
3. Spark gets a modified query plan
based on dynamic security policy.
4. Spark reads data from LLAP.
Filtering / masking guaranteed by
LLAP server.
HiveServer2
Authorization
Hive Metastore
Data Locations
View Definitions
LLAP
Data Read
Filter Pushdown
Ranger Server
Dynamic Policies
Spark Client
1
2
4
3

Apache Zeppelin, Attaches to Hive and Spark

But Wait, There’s More
Monitoring & Management
Data Lifecycle Management
Replication & D/R
Data Ingestion

Scalable Data Warehousing on Hadoop
Capabilities
Batch SQL OLAP / CubeInteractive SQL
Sub-Second
SQL
ACID / MERGE
Applications
• ETL
• Reporting
• Data Mining
• Deep Analytics
• Multidimensional
Analytics
• MDX Tools
• Excel
• Reporting
• BI Tools: Tableau,
Microstrategy,
Cognos
• Ad-Hoc
• Drill-Down
• BI Tools: Tableau,
Excel
• Continuous
Ingestion from
Operational DBMS
• Slowly Changing
Dimensions
Existing
Development
Emerging
Legend
Core
Platform
Scale-Out Storage
Petabyte Scale
Processing
Core SQL Engine
Apache Tez: Scalable
Distributed Processing
Advanced Cost-Based
Optimizer
Connectivity
Advanced Security
JDBC / ODBC
Comprehensive
SQL:2011 Coverage
MDX

For More Details
 Today
– LLAP: Building Cloud First BI – 5:50pm
 Wednesday
– An Overview of Optimization in Apache Hive: Past, Present, Future – 5:00pm
 Thursday
– Transactional SQL in Apache Hive – 3:00pm

An Apache Hive Based Data Warehouse

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to An Apache Hive Based Data Warehouse

Similar to An Apache Hive Based Data Warehouse (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

An Apache Hive Based Data Warehouse

Editor's Notes