Scalable Data Warehousing on
Hadoop
Alan F. Gates, Co-founder, Hortonworks
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What Do You Expect in a Hadoop Data Warehouse?
Benchmarks focus on two questions:
– How much of the TPC-DS query set can it run?
– How fast can it run it?
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What You Expect in a Data Warehouse?
High Performance
SQL 2011
High Storage Capacity
Security
Support for BI,
Cubes, Data Science
Monitoring & Management
Governance
Data Lifecycle Management
Replication & D/R
Workload Management
Data Ingestion
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
So, back to TPC-DS...
High Performance
SQL 2011
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive Overview
Apache Hive is a SQL data warehouse engine that
delivers fast, scalable SQL processing on Hadoop and
in the Cloud.
Features:
• Extensive SQL:2011 Support
• ACID Transactions
• In-Memory Caching
• Cost-Based Optimizer
• User-Based Dynamic Security
• JDBC and ODBC Support
• Compatible with every major BI Tool
• Proven at 300+ PB Scale
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive: Fast Facts
Most Queries Per Hour
100,000 Queries Per Hour
(Yahoo Japan)
Analytics Performance
100 Million rows/s Per Node
(with Hive LLAP)
Largest Hive Warehouse
300+ PB Raw Storage
(Facebook)
Largest Cluster
4,500+ Nodes
(Yahoo)
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Types SQL Features File Formats Futures
Numeric Core SQL Features Columnar ACID MERGE
FLOAT, DOUBLE Date, Time and Arithmetical Functions ORCFile Multi Subquery
DECIMAL INNER, OUTER, CROSS and SEMI Joins Parquet Scalar Subqueries
INT, TINYINT, SMALLINT, BIGINT Derived Table Subqueries Text Non-Equijoins
BOOLEAN Correlated + Uncorrelated Subqueries CSV INTERSECT / EXCEPT
String UNION ALL Logfile
CHAR, VARCHAR UDFs, UDAFs, UDTFs Nested / Complex Recursive CTEs
BLOB (BINARY), CLOB (String) Common Table Expressions Avro NOT NULL Constraints
Date, Time UNION DISTINCT JSON Default Values
DATE, TIMESTAMP, Interval Types Advanced Analytics XML Multi-statement Transactions
Complex Types OLAP and Windowing Functions Custom Formats
ARRAY / MAP / STRUCT / UNION OLAP: Partition, Order by UDAF Other Features
Nested Data Analytics CUBE and Grouping Sets XPath Analytics
Nested Data Traversal ACID Transactions
Lateral Views INSERT / UPDATE / DELETE
Procedural Extensions Constraints
HPL/SQL Primary / Foreign Key (Non Validated)
Apache Hive: Journey to SQL:2011 Analytics
Legend
New
Future work
Hive 2.next
Track Hive SQL:2011 Complete: HIVE-13554
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: Architecture Overview
Deep
Storage
YARN Cluster
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
HiveServer2
(Query
Endpoint)
ODBC /
JDBC
SQL
Queries In-Memory Cache
(Shared Across All Users)
HDFS and
Compatible
S3 WASB Isilon
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
0
5
10
15
20
25
30
35
40
45
50
0
50
100
150
200
250
Speedup(xFactor)
QueryTime(s)(LowerisBetter)
Hive 2 with LLAP averages 26x faster than Hive 1
Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)
Hive 2 with LLAP: 25+x Performance Boost: Interactive / 1TB Scale
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive vs. Apache Impala at 10TB
 10TB scale on 10 identical
AWS nodes.
 Hive and Impala showed
similar times on most
smaller queries.
 Hive scaled better, with
many queries completing in
<2m where Impala ran to
timeout (3000s).
Highlights
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive vs. Presto on a partitioned 1TB dataset.
 Presto lacks basic
performance optimizations
like dynamic partition
pruning.
 On a real dataset / workload
Presto perform poorly
without full re-writes.
 Example: Query 55 without
re-writes = 185.17s, with re-
writes = 16s. LLAP = 1.37s.
Highlights
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive LLAP: Stable Performance under High Concurrency
4x Queries,
2.8x
Runtime
Difference
5x Queries,
4.6x
Runtime
Difference
Mark
Concurrent
Queries
Average
Runtime
5 7.76s
25 36.24s
100 102.89s
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How Much Can it Hold, and Where?
High Storage Capacity
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Storage
 Of course HDFS, default in the Hadoop world
 More and more cloud
 Move is copy in S3, but current implementation assumes move is atomic and nearly free
– modifying Hadoop (HADOOP-11694) and Hive (HIVE-14535)
 ACID in the cloud
– Compactor moves a lot of files around, need to optimize
– Need to figure out how streaming ingest works in the cloud
 LLAP, caching much more valuable in the cloud
– Looking at flushing cache to SSD so misses are less costly
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Is My Data Safe?
Security
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
• Wire
encryption
• HDFS
encryption +
Ranger KMS
• Centralized
audit
reporting w/
Apache
Ranger
• Fine grain
access control
with Apache
Ranger
Security today in Hadoop
Authorization
What can I do?
Audit
What did I do?
Data Protection
Can data be encrypted at
rest and over the wire?
• Kerberos
• API security
with Apache
Knox
Authentication
Who am I/prove
it?
Centralized Security Administration w/ Ranger & Knox
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Authentication—API Security with Knox
• Eliminates SSH “edge node”
• Central API management
• Central audit control
• Service level authorization
• SSO - SAMLv2, Siteminder
and OAM
• LDAP and AD integration
• SSO for Hadoop UIs (Ranger,
Ambari..)
Apache Knox extends the reach of Hadoop REST API without
Kerberos complexities
Integrated with existing IdM
systems
Single, simple point of
access for a cluster
Centralized and consistent
secure API across one or
more clusters
• Kerberos Encapsulation
• Single Hadoop access point
• REST API hierarchy
• Consolidated API calls
• Multi-cluster support
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
LLAP Data Access
User ID Region Total Spend
1 East 5,131
2 East 27,828
3 West 55,493
4 West 7,193
5 East 18,193
Apache Ranger: Per-User Row Filtering by Region in Hive
User 2
(East Region)
User 1
(West Region)
Original Query:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
Query Rewrites based on
Dynamic Ranger Policies
Dynamic Rewrite:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
AND region = “east”
Dynamic Rewrite:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
AND region = “west”
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger: Dynamic Data Masking of Hive Columns
R A N G E R
Protect Sensitive Data in real-time with Dynamic Data Masking/Obfuscation!
Goal: Mask or anonymize sensitive columns of data
(e.g. PII, PCI, PHI) from Hive query output
⬢ Benefits
– Sensitive information never leaves database
– No changes are required at the application or Hive layer
– No need to produce additional protected duplicate
versions of datasets
– Simple & easy to setup masking policies
⬢ Core Technologies: Ranger, Hive
AT L A S
H I V E
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamic Tag-based Access Policies with Apache Atlas
• Basic Tag policy – PII example. Access and
entitlements must be tag based ABAC and scalable in
implementation.
• Geo-based policy – Policy based on IP address, proxy
IP substitution maybe required. The rule
enforcement must be geo aware.
• Time-based policy – Timer for data access, de-
coupled from deletion of data.
• Prohibitions – Prevention of combination of Hive
tables that may pose a risk together.
Key Benefits:
New scalable metadata
based security paradigm
Dynamic, real-time policy
Active protection – fast
updates to changes
Centralized and simple to
manage policy
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What’s There and Where Did It Come From?
Governance
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop
Teradata
Connector
Apache
Kafka
Apache Atlas: Cross-Component Dataset Lineage
Custom
Activity
Reporter
Metadata
Repository
RDBMS
Any process
using Sqoop is
covered
No other tool
tracks IOT out
of the box
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas Enables Business Catalog for Ease of Use
 Organize data assets along business terms
– Authoritative: Hierarchical Taxonomy Creation
– Agile modeling: Model Conceptual, Logical, Physical assets
– Definition and assignment of tags like PII (Personally
Identifiable Information)
 Comprehensive features for compliance
– Multiple user profiles including Data Steward and Business
Analysts
– Object auditing to track “Who did it”
– Metadata Versioning to track ”what did they do”
 Faster Insight:
– Data Quality tab for profiling and sampling
– User Comments
Key Benefits:
Organize data assets along
business terms
Compliance Features
Faster Insight
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How Will My Users Interact With It?
Support for BI,
Cubes, Data Science
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid: Deep Multidimensional Analytics
Real-Time
Analytics
Hive /
Spark
BI Tools
REST
API
Superset
UI
Events
Logs
Trans-
actions
Sensors
Historical
Sources
HDFS S3
Druid Data Cubes
Ultra-Fast Analytics
Slice-and-Dice
Streaming
Sources
Storm
Kafka Spark
Deep, Fast Drilldown
Across Any Dimension
Scalably Ingest Historical Data from
Transactional and Web Systems
= Future
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid’s Role in Scalable Data Warehousing
UI
Core Platform
S3 or HDFS
HiveServer2
MDX
Unified SQL and MDX Layer
SQL BI Tools MDX Tools
Hive
Realtime Feeds
(Kafka, Storm, etc.)
Druid
OLAP Indexes
HiveServer2
Hive SQL
Thrift Server
SparkSQL
Fast SQL MDX
Superset UI
Fast Exploration
Ranger
Atlas
Ambari
Management
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Analytics at Scale with No Data Movement
Syncsort
High-Performance
Data Movement
Hadoop
Scalable Storage and Compute
Hive LLAP
High Performance SQL
AtScale Intelligence Platform
OLAP Cubes for Higher Performance
Source Data
Systems
Fast, scalable SQL analytics
Intelligent in-memory caching
Define OLAP cubes for 10x faster queries
Unified semantic layer for all BI tools
High performance data import
from all major EDW platforms
Pre-aggregated
data
... Or, full-fidelity
data
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Column Security with LLAP
 Fine-Grained Column Level Access Control for SparkSQL.
 Fully dynamic policies per user. Doesn’t require views.
 Use Standard Ranger policies and tools to control access and masking policies.
Flow:
1. SparkSQL gets data locations
known as “splits” from HiveServer
and plans query.
2. HiveServer2 authorizes access
using Ranger. Per-user policies
like row filtering are applied.
3. Spark gets a modified query plan
based on dynamic security policy.
4. Spark reads data from LLAP.
Filtering / masking guaranteed by
LLAP server.
HiveServer2
Authorization
Hive Metastore
Data Locations
View Definitions
LLAP
Data Read
Filter Pushdown
Ranger Server
Dynamic Policies
Spark Client
1
2
4
3
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin, Attaches to Hive and Spark
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
But Wait, There’s More
Monitoring & Management
Data Lifecycle Management
Replication & D/R
Data Ingestion
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scalable Data Warehousing on Hadoop
Capabilities
Batch SQL OLAP / CubeInteractive SQL
Sub-Second
SQL
ACID / MERGE
Applications
• ETL
• Reporting
• Data Mining
• Deep Analytics
• Multidimensional
Analytics
• MDX Tools
• Excel
• Reporting
• BI Tools: Tableau,
Microstrategy,
Cognos
• Ad-Hoc
• Drill-Down
• BI Tools: Tableau,
Excel
• Continuous
Ingestion from
Operational DBMS
• Slowly Changing
Dimensions
Existing
Development
Emerging
Legend
Core
Platform
Scale-Out Storage
Petabyte Scale
Processing
Core SQL Engine
Apache Tez: Scalable
Distributed Processing
Advanced Cost-Based
Optimizer
Connectivity
Advanced Security
JDBC / ODBC
Comprehensive
SQL:2011 Coverage
MDX
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
For More Details
 Today:
– Interactive Analytics at Scale in Apache Hive Using Druid – 12:20
– Information is Beautiful: Apache Zeppelin Edition – 14:10
– LLAP: Sub-Second Analytical Queries in Hive – 15:00
– Apache Atlas: Governance for Your Data – 16:10
– An Overview on Optimization in Apache Hive: Past, Present, Future – 16:10
– An Approach for Multi-Tenancy Through Apache Knox – 17:00
 Tomorrow
– Cloudy with a Chance of Hadoop – Real World Considerations – 11:30
– Row/Column-Level Security in SQL for Apache Spark – 14:10
– Unleashing the Power of Apache Atlas with Apache Ranger – 15:00
– Birds of a Feather sessions – 17:50

Hive edw-dataworks summit-eu-april-2017

  • 1.
    Scalable Data Warehousingon Hadoop Alan F. Gates, Co-founder, Hortonworks
  • 2.
    2 © HortonworksInc. 2011 – 2016. All Rights Reserved What Do You Expect in a Hadoop Data Warehouse? Benchmarks focus on two questions: – How much of the TPC-DS query set can it run? – How fast can it run it?
  • 3.
    3 © HortonworksInc. 2011 – 2016. All Rights Reserved What You Expect in a Data Warehouse? High Performance SQL 2011 High Storage Capacity Security Support for BI, Cubes, Data Science Monitoring & Management Governance Data Lifecycle Management Replication & D/R Workload Management Data Ingestion
  • 4.
    4 © HortonworksInc. 2011 – 2016. All Rights Reserved So, back to TPC-DS... High Performance SQL 2011
  • 5.
    5 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache Hive Overview Apache Hive is a SQL data warehouse engine that delivers fast, scalable SQL processing on Hadoop and in the Cloud. Features: • Extensive SQL:2011 Support • ACID Transactions • In-Memory Caching • Cost-Based Optimizer • User-Based Dynamic Security • JDBC and ODBC Support • Compatible with every major BI Tool • Proven at 300+ PB Scale
  • 6.
    6 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache Hive: Fast Facts Most Queries Per Hour 100,000 Queries Per Hour (Yahoo Japan) Analytics Performance 100 Million rows/s Per Node (with Hive LLAP) Largest Hive Warehouse 300+ PB Raw Storage (Facebook) Largest Cluster 4,500+ Nodes (Yahoo)
  • 7.
    7 © HortonworksInc. 2011 – 2016. All Rights Reserved Data Types SQL Features File Formats Futures Numeric Core SQL Features Columnar ACID MERGE FLOAT, DOUBLE Date, Time and Arithmetical Functions ORCFile Multi Subquery DECIMAL INNER, OUTER, CROSS and SEMI Joins Parquet Scalar Subqueries INT, TINYINT, SMALLINT, BIGINT Derived Table Subqueries Text Non-Equijoins BOOLEAN Correlated + Uncorrelated Subqueries CSV INTERSECT / EXCEPT String UNION ALL Logfile CHAR, VARCHAR UDFs, UDAFs, UDTFs Nested / Complex Recursive CTEs BLOB (BINARY), CLOB (String) Common Table Expressions Avro NOT NULL Constraints Date, Time UNION DISTINCT JSON Default Values DATE, TIMESTAMP, Interval Types Advanced Analytics XML Multi-statement Transactions Complex Types OLAP and Windowing Functions Custom Formats ARRAY / MAP / STRUCT / UNION OLAP: Partition, Order by UDAF Other Features Nested Data Analytics CUBE and Grouping Sets XPath Analytics Nested Data Traversal ACID Transactions Lateral Views INSERT / UPDATE / DELETE Procedural Extensions Constraints HPL/SQL Primary / Foreign Key (Non Validated) Apache Hive: Journey to SQL:2011 Analytics Legend New Future work Hive 2.next Track Hive SQL:2011 Complete: HIVE-13554
  • 8.
    8 © HortonworksInc. 2011 – 2016. All Rights Reserved Hive 2 with LLAP: Architecture Overview Deep Storage YARN Cluster LLAP Daemon Query Executors LLAP Daemon Query Executors LLAP Daemon Query Executors LLAP Daemon Query Executors Query Coordinators Coord- inator Coord- inator Coord- inator HiveServer2 (Query Endpoint) ODBC / JDBC SQL Queries In-Memory Cache (Shared Across All Users) HDFS and Compatible S3 WASB Isilon
  • 9.
    9 © HortonworksInc. 2011 – 2016. All Rights Reserved 0 5 10 15 20 25 30 35 40 45 50 0 50 100 150 200 250 Speedup(xFactor) QueryTime(s)(LowerisBetter) Hive 2 with LLAP averages 26x faster than Hive 1 Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor) Hive 2 with LLAP: 25+x Performance Boost: Interactive / 1TB Scale
  • 10.
    10 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache Hive vs. Apache Impala at 10TB  10TB scale on 10 identical AWS nodes.  Hive and Impala showed similar times on most smaller queries.  Hive scaled better, with many queries completing in <2m where Impala ran to timeout (3000s). Highlights
  • 11.
    11 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache Hive vs. Presto on a partitioned 1TB dataset.  Presto lacks basic performance optimizations like dynamic partition pruning.  On a real dataset / workload Presto perform poorly without full re-writes.  Example: Query 55 without re-writes = 185.17s, with re- writes = 16s. LLAP = 1.37s. Highlights
  • 12.
    12 © HortonworksInc. 2011 – 2016. All Rights Reserved Hive LLAP: Stable Performance under High Concurrency 4x Queries, 2.8x Runtime Difference 5x Queries, 4.6x Runtime Difference Mark Concurrent Queries Average Runtime 5 7.76s 25 36.24s 100 102.89s
  • 13.
    13 © HortonworksInc. 2011 – 2016. All Rights Reserved How Much Can it Hold, and Where? High Storage Capacity
  • 14.
    14 © HortonworksInc. 2011 – 2016. All Rights Reserved Storage  Of course HDFS, default in the Hadoop world  More and more cloud  Move is copy in S3, but current implementation assumes move is atomic and nearly free – modifying Hadoop (HADOOP-11694) and Hive (HIVE-14535)  ACID in the cloud – Compactor moves a lot of files around, need to optimize – Need to figure out how streaming ingest works in the cloud  LLAP, caching much more valuable in the cloud – Looking at flushing cache to SSD so misses are less costly
  • 15.
    15 © HortonworksInc. 2011 – 2016. All Rights Reserved Is My Data Safe? Security
  • 16.
    16 © HortonworksInc. 2011 – 2016. All Rights Reserved • Wire encryption • HDFS encryption + Ranger KMS • Centralized audit reporting w/ Apache Ranger • Fine grain access control with Apache Ranger Security today in Hadoop Authorization What can I do? Audit What did I do? Data Protection Can data be encrypted at rest and over the wire? • Kerberos • API security with Apache Knox Authentication Who am I/prove it? Centralized Security Administration w/ Ranger & Knox
  • 17.
    17 © HortonworksInc. 2011 – 2016. All Rights Reserved Authentication—API Security with Knox • Eliminates SSH “edge node” • Central API management • Central audit control • Service level authorization • SSO - SAMLv2, Siteminder and OAM • LDAP and AD integration • SSO for Hadoop UIs (Ranger, Ambari..) Apache Knox extends the reach of Hadoop REST API without Kerberos complexities Integrated with existing IdM systems Single, simple point of access for a cluster Centralized and consistent secure API across one or more clusters • Kerberos Encapsulation • Single Hadoop access point • REST API hierarchy • Consolidated API calls • Multi-cluster support
  • 18.
    18 © HortonworksInc. 2011 – 2016. All Rights Reserved LLAP Data Access User ID Region Total Spend 1 East 5,131 2 East 27,828 3 West 55,493 4 West 7,193 5 East 18,193 Apache Ranger: Per-User Row Filtering by Region in Hive User 2 (East Region) User 1 (West Region) Original Query: SELECT * from CUSTOMERS WHERE total_spend > 10000 Query Rewrites based on Dynamic Ranger Policies Dynamic Rewrite: SELECT * from CUSTOMERS WHERE total_spend > 10000 AND region = “east” Dynamic Rewrite: SELECT * from CUSTOMERS WHERE total_spend > 10000 AND region = “west”
  • 19.
    19 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache Ranger: Dynamic Data Masking of Hive Columns R A N G E R Protect Sensitive Data in real-time with Dynamic Data Masking/Obfuscation! Goal: Mask or anonymize sensitive columns of data (e.g. PII, PCI, PHI) from Hive query output ⬢ Benefits – Sensitive information never leaves database – No changes are required at the application or Hive layer – No need to produce additional protected duplicate versions of datasets – Simple & easy to setup masking policies ⬢ Core Technologies: Ranger, Hive AT L A S H I V E
  • 20.
    20 © HortonworksInc. 2011 – 2016. All Rights Reserved Dynamic Tag-based Access Policies with Apache Atlas • Basic Tag policy – PII example. Access and entitlements must be tag based ABAC and scalable in implementation. • Geo-based policy – Policy based on IP address, proxy IP substitution maybe required. The rule enforcement must be geo aware. • Time-based policy – Timer for data access, de- coupled from deletion of data. • Prohibitions – Prevention of combination of Hive tables that may pose a risk together. Key Benefits: New scalable metadata based security paradigm Dynamic, real-time policy Active protection – fast updates to changes Centralized and simple to manage policy
  • 21.
    21 © HortonworksInc. 2011 – 2016. All Rights Reserved What’s There and Where Did It Come From? Governance
  • 22.
    22 © HortonworksInc. 2011 – 2016. All Rights Reserved Sqoop Teradata Connector Apache Kafka Apache Atlas: Cross-Component Dataset Lineage Custom Activity Reporter Metadata Repository RDBMS Any process using Sqoop is covered No other tool tracks IOT out of the box
  • 23.
    23 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache Atlas Enables Business Catalog for Ease of Use  Organize data assets along business terms – Authoritative: Hierarchical Taxonomy Creation – Agile modeling: Model Conceptual, Logical, Physical assets – Definition and assignment of tags like PII (Personally Identifiable Information)  Comprehensive features for compliance – Multiple user profiles including Data Steward and Business Analysts – Object auditing to track “Who did it” – Metadata Versioning to track ”what did they do”  Faster Insight: – Data Quality tab for profiling and sampling – User Comments Key Benefits: Organize data assets along business terms Compliance Features Faster Insight
  • 24.
    24 © HortonworksInc. 2011 – 2016. All Rights Reserved How Will My Users Interact With It? Support for BI, Cubes, Data Science
  • 25.
    25 © HortonworksInc. 2011 – 2016. All Rights Reserved Druid: Deep Multidimensional Analytics Real-Time Analytics Hive / Spark BI Tools REST API Superset UI Events Logs Trans- actions Sensors Historical Sources HDFS S3 Druid Data Cubes Ultra-Fast Analytics Slice-and-Dice Streaming Sources Storm Kafka Spark Deep, Fast Drilldown Across Any Dimension Scalably Ingest Historical Data from Transactional and Web Systems = Future
  • 26.
    26 © HortonworksInc. 2011 – 2016. All Rights Reserved Druid’s Role in Scalable Data Warehousing UI Core Platform S3 or HDFS HiveServer2 MDX Unified SQL and MDX Layer SQL BI Tools MDX Tools Hive Realtime Feeds (Kafka, Storm, etc.) Druid OLAP Indexes HiveServer2 Hive SQL Thrift Server SparkSQL Fast SQL MDX Superset UI Fast Exploration Ranger Atlas Ambari Management
  • 27.
    27 © HortonworksInc. 2011 – 2016. All Rights Reserved Analytics at Scale with No Data Movement Syncsort High-Performance Data Movement Hadoop Scalable Storage and Compute Hive LLAP High Performance SQL AtScale Intelligence Platform OLAP Cubes for Higher Performance Source Data Systems Fast, scalable SQL analytics Intelligent in-memory caching Define OLAP cubes for 10x faster queries Unified semantic layer for all BI tools High performance data import from all major EDW platforms Pre-aggregated data ... Or, full-fidelity data
  • 28.
    28 © HortonworksInc. 2011 – 2016. All Rights Reserved Spark Column Security with LLAP  Fine-Grained Column Level Access Control for SparkSQL.  Fully dynamic policies per user. Doesn’t require views.  Use Standard Ranger policies and tools to control access and masking policies. Flow: 1. SparkSQL gets data locations known as “splits” from HiveServer and plans query. 2. HiveServer2 authorizes access using Ranger. Per-user policies like row filtering are applied. 3. Spark gets a modified query plan based on dynamic security policy. 4. Spark reads data from LLAP. Filtering / masking guaranteed by LLAP server. HiveServer2 Authorization Hive Metastore Data Locations View Definitions LLAP Data Read Filter Pushdown Ranger Server Dynamic Policies Spark Client 1 2 4 3
  • 29.
    29 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache Zeppelin, Attaches to Hive and Spark
  • 30.
    30 © HortonworksInc. 2011 – 2016. All Rights Reserved But Wait, There’s More Monitoring & Management Data Lifecycle Management Replication & D/R Data Ingestion
  • 31.
    31 © HortonworksInc. 2011 – 2016. All Rights Reserved Scalable Data Warehousing on Hadoop Capabilities Batch SQL OLAP / CubeInteractive SQL Sub-Second SQL ACID / MERGE Applications • ETL • Reporting • Data Mining • Deep Analytics • Multidimensional Analytics • MDX Tools • Excel • Reporting • BI Tools: Tableau, Microstrategy, Cognos • Ad-Hoc • Drill-Down • BI Tools: Tableau, Excel • Continuous Ingestion from Operational DBMS • Slowly Changing Dimensions Existing Development Emerging Legend Core Platform Scale-Out Storage Petabyte Scale Processing Core SQL Engine Apache Tez: Scalable Distributed Processing Advanced Cost-Based Optimizer Connectivity Advanced Security JDBC / ODBC Comprehensive SQL:2011 Coverage MDX
  • 32.
    32 © HortonworksInc. 2011 – 2016. All Rights Reserved For More Details  Today: – Interactive Analytics at Scale in Apache Hive Using Druid – 12:20 – Information is Beautiful: Apache Zeppelin Edition – 14:10 – LLAP: Sub-Second Analytical Queries in Hive – 15:00 – Apache Atlas: Governance for Your Data – 16:10 – An Overview on Optimization in Apache Hive: Past, Present, Future – 16:10 – An Approach for Multi-Tenancy Through Apache Knox – 17:00  Tomorrow – Cloudy with a Chance of Hadoop – Real World Considerations – 11:30 – Row/Column-Level Security in SQL for Apache Spark – 14:10 – Unleashing the Power of Apache Atlas with Apache Ranger – 15:00 – Birds of a Feather sessions – 17:50

Editor's Notes

  • #4 Notes: This being Apache there are 50 ways to assemble this, I’m going to cover one There are a lot of parts in the picture, I won’t be able to cover them all For several of these I want to look at what’s there now, and what communities are working on to improve this experience
  • #8 What tools/processes have you tried to attach to Hadoop and have been unable to do so? Why?
  • #12 Hive version = Hive 2.1 Presto version = Presto 0.163
  • #13 Additional details: Queries are 20 interactive queries taken from TPC-DS using 1 TB of data from TPC-DS schema. Queries run at random using a jmeter test harness
  • #18 Extend reach of Hadoop APIs Gateway for Hadoop’s REST API Different REST APIs have varying levels of AuthN, AuthZ, SSL, SSO Capability Enterprise authentication Apply Enterprise capabilities to All REST APIs – IdM Integration, SSO, OAuth, SAML Avoid exposing Cluster port, hostnames to all users
  • #23 Show – clearly identify customer metadata. Change Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagnosis ** bring meta from external systems into hadoop – keep it together
  • #26 Maps well to Yahoo case
  • #31 Notes: This being Apache there are 50 ways to assemble this, I’m going to cover one There are a lot of parts in the picture, I won’t be able to cover them all For several of these I want to look at what’s there now, and what communities are working on to improve this experience