SlideShare a Scribd company logo
Scalable Data Warehousing on
Hadoop
Alan F. Gates, Co-founder, Hortonworks
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What Do You Expect in a Hadoop Data Warehouse?
Benchmarks focus on two questions:
– How much of the TPC-DS query set can it run?
– How fast can it run it?
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What You Expect in a Data Warehouse?
High Performance
SQL 2011
High Storage Capacity
Security
Support for BI,
Cubes, Data Science
Monitoring & Management
Governance
Data Lifecycle Management
Replication & D/R
Workload Management
Data Ingestion
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
So, back to TPC-DS...
High Performance
SQL 2011
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive Overview
Apache Hive is a SQL data warehouse engine that
delivers fast, scalable SQL processing on Hadoop and
in the Cloud.
Features:
• Extensive SQL:2011 Support
• ACID Transactions
• In-Memory Caching
• Cost-Based Optimizer
• User-Based Dynamic Security
• JDBC and ODBC Support
• Compatible with every major BI Tool
• Proven at 300+ PB Scale
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive: Fast Facts
Most Queries Per Hour
100,000 Queries Per Hour
(Yahoo Japan)
Analytics Performance
100 Million rows/s Per Node
(with Hive LLAP)
Largest Hive Warehouse
300+ PB Raw Storage
(Facebook)
Largest Cluster
4,500+ Nodes
(Yahoo)
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Types SQL Features File Formats Futures
Numeric Core SQL Features Columnar ACID MERGE
FLOAT, DOUBLE Date, Time and Arithmetical Functions ORCFile Multi Subquery
DECIMAL INNER, OUTER, CROSS and SEMI Joins Parquet Scalar Subqueries
INT, TINYINT, SMALLINT, BIGINT Derived Table Subqueries Text Non-Equijoins
BOOLEAN Correlated + Uncorrelated Subqueries CSV INTERSECT / EXCEPT
String UNION ALL Logfile
CHAR, VARCHAR UDFs, UDAFs, UDTFs Nested / Complex Recursive CTEs
BLOB (BINARY), CLOB (String) Common Table Expressions Avro NOT NULL Constraints
Date, Time UNION DISTINCT JSON Default Values
DATE, TIMESTAMP, Interval Types Advanced Analytics XML Multi-statement Transactions
Complex Types OLAP and Windowing Functions Custom Formats
ARRAY / MAP / STRUCT / UNION OLAP: Partition, Order by UDAF Other Features
Nested Data Analytics CUBE and Grouping Sets XPath Analytics
Nested Data Traversal ACID Transactions
Lateral Views INSERT / UPDATE / DELETE
Procedural Extensions Constraints
HPL/SQL Primary / Foreign Key (Non Validated)
Apache Hive: Journey to SQL:2011 Analytics
Legend
New
Future work
Hive 2
Track Hive SQL:2011 Complete: HIVE-13554
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: Architecture Overview
Deep
Storage
YARN Cluster
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
HiveServer2
(Query
Endpoint)
ODBC /
JDBC
SQL
Queries In-Memory Cache
(Shared Across All Users)
HDFS and
Compatible
S3 WASB Isilon
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
0
5
10
15
20
25
30
35
40
45
50
0
50
100
150
200
250
Speedup(xFactor)
QueryTime(s)(LowerisBetter)
Hive 2 with LLAP averages 26x faster than Hive 1
Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)
Hive 2 with LLAP: 25+x Performance Boost: Interactive / 1TB Scale
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive vs. Apache Impala at 10TB
 10TB scale on 10 identical
AWS nodes.
 Hive and Impala showed
similar times on most
smaller queries.
 Hive scaled better, with
many queries completing in
<2m where Impala ran to
timeout (3000s).
Highlights
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive vs. Presto on a partitioned 1TB dataset.
 Presto lacks basic
performance optimizations
like dynamic partition
pruning.
 On a real dataset / workload
Presto perform poorly
without full re-writes.
 Example: Query 55 without
re-writes = 185.17s, with re-
writes = 16s. LLAP = 1.37s.
Highlights
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive LLAP: Stable Performance under High Concurrency
4x Queries,
2.8x
Runtime
Difference
5x Queries,
4.6x
Runtime
Difference
Mark
Concurrent
Queries
Average
Runtime
5 7.76s
25 36.24s
100 102.89s
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How Much Can it Hold, and Where?
High Storage Capacity
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Storage
 Of course HDFS, default in the Hadoop world
 More and more cloud
 Move is copy in S3, but current implementation assumes move is atomic and nearly free
– modifying Hadoop (HADOOP-11694) and Hive (HIVE-14535)
 ACID in the cloud
– Compactor moves a lot of files around, need to optimize
– Need to figure out how streaming ingest works in the cloud
 LLAP, caching much more valuable in the cloud
– Looking at flushing cache to SSD so misses are less costly
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Is My Data Safe?
Security
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
• Wire
encryption
• HDFS
encryption +
Ranger KMS
• Centralized
audit
reporting w/
Apache
Ranger
• Fine grain
access control
with Apache
Ranger
Security today in Hadoop
Authorization
What can I do?
Audit
What did I do?
Data Protection
Can data be encrypted at
rest and over the wire?
• Kerberos
• API security
with Apache
Knox
Authentication
Who am I/prove
it?
Centralized Security Administration w/ Ranger & Knox
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Authentication—API Security with Knox
• Eliminates SSH “edge node”
• Central API management
• Central audit control
• Service level authorization
• SSO - SAMLv2, Siteminder
and OAM
• LDAP and AD integration
• SSO for Hadoop UIs (Ranger,
Ambari..)
Apache Knox extends the reach of Hadoop REST API without
Kerberos complexities
Integrated with existing IdM
systems
Single, simple point of
access for a cluster
Centralized and consistent
secure API across one or
more clusters
• Kerberos Encapsulation
• Single Hadoop access point
• REST API hierarchy
• Consolidated API calls
• Multi-cluster support
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
LLAP Data Access
User ID Region Total Spend
1 East 5,131
2 East 27,828
3 West 55,493
4 West 7,193
5 East 18,193
Apache Ranger: Per-User Row Filtering by Region in Hive
User 2
(East Region)
User 1
(West Region)
Original Query:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
Query Rewrites based on
Dynamic Ranger Policies
Dynamic Rewrite:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
AND region = “east”
Dynamic Rewrite:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
AND region = “west”
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger: Dynamic Data Masking of Hive Columns
R A N G E R
Protect Sensitive Data in real-time with Dynamic Data Masking/Obfuscation!
Goal: Mask or anonymize sensitive columns of data
(e.g. PII, PCI, PHI) from Hive query output
⬢ Benefits
– Sensitive information never leaves database
– No changes are required at the application or Hive layer
– No need to produce additional protected duplicate
versions of datasets
– Simple & easy to setup masking policies
⬢ Core Technologies: Ranger, Hive
AT L A S
H I V E
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamic Tag-based Access Policies with Apache Atlas
• Basic Tag policy – PII example. Access and
entitlements must be tag based ABAC and scalable in
implementation.
• Geo-based policy – Policy based on IP address, proxy
IP substitution maybe required. The rule
enforcement must be geo aware.
• Time-based policy – Timer for data access, de-
coupled from deletion of data.
• Prohibitions – Prevention of combination of Hive
tables that may pose a risk together.
Key Benefits:
New scalable metadata
based security paradigm
Dynamic, real-time policy
Active protection – fast
updates to changes
Centralized and simple to
manage policy
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What’s There and Where Did It Come From?
Governance
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop
Teradata
Connector
Apache
Kafka
Apache Atlas: Cross-Component Dataset Lineage
Custom
Activity
Reporter
Metadata
Repository
RDBMS
Any process
using Sqoop is
covered
No other tool
tracks IOT out
of the box
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas Enables Business Catalog for Ease of Use
 Organize data assets along business terms
– Authoritative: Hierarchical Taxonomy Creation
– Agile modeling: Model Conceptual, Logical, Physical assets
– Definition and assignment of tags like PII (Personally
Identifiable Information)
 Comprehensive features for compliance
– Multiple user profiles including Data Steward and Business
Analysts
– Object auditing to track “Who did it”
– Metadata Versioning to track ”what did they do”
 Faster Insight:
– Data Quality tab for profiling and sampling
– User Comments
Key Benefits:
Organize data assets along
business terms
Compliance Features
Faster Insight
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How Will My Users Interact With It?
Support for BI,
Cubes, Data Science
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid: Deep Multidimensional Analytics
Real-Time
Analytics
Hive /
Spark
BI Tools
REST
API
Superset
UI
Events
Logs
Trans-
actions
Sensors
Historical
Sources
HDFS S3
Druid Data Cubes
Ultra-Fast Analytics
Slice-and-Dice
Streaming
Sources
Storm
Kafka Spark
Deep, Fast Drilldown
Across Any Dimension
Scalably Ingest Historical Data from
Transactional and Web Systems
= Future
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid’s Role in Scalable Data Warehousing
UI
Core Platform
S3 or HDFS
HiveServer2
MDX
Unified SQL and MDX Layer
SQL BI Tools MDX Tools
Hive
Realtime Feeds
(Kafka, Storm, etc.)
Druid
OLAP Indexes
HiveServer2
Hive SQL
Thrift Server
SparkSQL
Fast SQL MDX
Superset UI
Fast Exploration
Ranger
Atlas
Ambari
Management
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Analytics at Scale with No Data Movement
Syncsort
High-Performance
Data Movement
Hadoop
Scalable Storage and Compute
Hive LLAP
High Performance SQL
AtScale Intelligence Platform
OLAP Cubes for Higher Performance
Source Data
Systems
Fast, scalable SQL analytics
Intelligent in-memory caching
Define OLAP cubes for 10x faster queries
Unified semantic layer for all BI tools
High performance data import
from all major EDW platforms
Pre-aggregated
data
... Or, full-fidelity
data
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Column Security with LLAP
 Fine-Grained Column Level Access Control for SparkSQL.
 Fully dynamic policies per user. Doesn’t require views.
 Use Standard Ranger policies and tools to control access and masking policies.
Flow:
1. SparkSQL gets data locations
known as “splits” from HiveServer
and plans query.
2. HiveServer2 authorizes access
using Ranger. Per-user policies
like row filtering are applied.
3. Spark gets a modified query plan
based on dynamic security policy.
4. Spark reads data from LLAP.
Filtering / masking guaranteed by
LLAP server.
HiveServer2
Authorization
Hive Metastore
Data Locations
View Definitions
LLAP
Data Read
Filter Pushdown
Ranger Server
Dynamic Policies
Spark Client
1
2
4
3
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin, Attaches to Hive and Spark
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
But Wait, There’s More
Monitoring & Management
Data Lifecycle Management
Replication & D/R
Data Ingestion
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scalable Data Warehousing on Hadoop
Capabilities
Batch SQL OLAP / CubeInteractive SQL
Sub-Second
SQL
ACID / MERGE
Applications
• ETL
• Reporting
• Data Mining
• Deep Analytics
• Multidimensional
Analytics
• MDX Tools
• Excel
• Reporting
• BI Tools: Tableau,
Microstrategy,
Cognos
• Ad-Hoc
• Drill-Down
• BI Tools: Tableau,
Excel
• Continuous
Ingestion from
Operational DBMS
• Slowly Changing
Dimensions
Existing
Development
Emerging
Legend
Core
Platform
Scale-Out Storage
Petabyte Scale
Processing
Core SQL Engine
Apache Tez: Scalable
Distributed Processing
Advanced Cost-Based
Optimizer
Connectivity
Advanced Security
JDBC / ODBC
Comprehensive
SQL:2011 Coverage
MDX
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
For More Details
 Today
– LLAP: Building Cloud First BI – 5:50pm
 Wednesday
– An Overview of Optimization in Apache Hive: Past, Present, Future – 5:00pm
 Thursday
– Transactional SQL in Apache Hive – 3:00pm

More Related Content

What's hot

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Databricks
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
Jacques Nadeau
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case study
Kishore Gopalakrishna
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Julien Le Dem
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
Databricks
 
Understanding the architecture of MariaDB ColumnStore
Understanding the architecture of MariaDB ColumnStoreUnderstanding the architecture of MariaDB ColumnStore
Understanding the architecture of MariaDB ColumnStore
MariaDB plc
 
Battle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveBattle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWave
Yingjun Wu
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Databricks
 

What's hot (20)

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case study
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Understanding the architecture of MariaDB ColumnStore
Understanding the architecture of MariaDB ColumnStoreUnderstanding the architecture of MariaDB ColumnStore
Understanding the architecture of MariaDB ColumnStore
 
Battle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveBattle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWave
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
 

Similar to An Apache Hive Based Data Warehouse

Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
Alberto Romero
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
DataWorks Summit
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
alanfgates
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
Big Data Spain
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
VMware Tanzu
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
Ashish Narasimham
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
Aldrin Piri
 
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationParis FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks Presentation
Abdelkrim Hadjidj
 
Dynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDPDynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDP
Hortonworks
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks
 
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
HortonworksJapan
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
DataWorks Summit/Hadoop Summit
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData Day
John Park
 
Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...
DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
DataWorks Summit
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
DataWorks Summit
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi Introduction
Milind Pandit
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging Manager
DataWorks Summit
 

Similar to An Apache Hive Based Data Warehouse (20)

Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
 
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationParis FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks Presentation
 
Dynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDPDynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDP
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - Webinar
 
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData Day
 
Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi Introduction
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging Manager
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 

Recently uploaded (20)

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 

An Apache Hive Based Data Warehouse

  • 1. Scalable Data Warehousing on Hadoop Alan F. Gates, Co-founder, Hortonworks
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What Do You Expect in a Hadoop Data Warehouse? Benchmarks focus on two questions: – How much of the TPC-DS query set can it run? – How fast can it run it?
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What You Expect in a Data Warehouse? High Performance SQL 2011 High Storage Capacity Security Support for BI, Cubes, Data Science Monitoring & Management Governance Data Lifecycle Management Replication & D/R Workload Management Data Ingestion
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved So, back to TPC-DS... High Performance SQL 2011
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Overview Apache Hive is a SQL data warehouse engine that delivers fast, scalable SQL processing on Hadoop and in the Cloud. Features: • Extensive SQL:2011 Support • ACID Transactions • In-Memory Caching • Cost-Based Optimizer • User-Based Dynamic Security • JDBC and ODBC Support • Compatible with every major BI Tool • Proven at 300+ PB Scale
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive: Fast Facts Most Queries Per Hour 100,000 Queries Per Hour (Yahoo Japan) Analytics Performance 100 Million rows/s Per Node (with Hive LLAP) Largest Hive Warehouse 300+ PB Raw Storage (Facebook) Largest Cluster 4,500+ Nodes (Yahoo)
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Types SQL Features File Formats Futures Numeric Core SQL Features Columnar ACID MERGE FLOAT, DOUBLE Date, Time and Arithmetical Functions ORCFile Multi Subquery DECIMAL INNER, OUTER, CROSS and SEMI Joins Parquet Scalar Subqueries INT, TINYINT, SMALLINT, BIGINT Derived Table Subqueries Text Non-Equijoins BOOLEAN Correlated + Uncorrelated Subqueries CSV INTERSECT / EXCEPT String UNION ALL Logfile CHAR, VARCHAR UDFs, UDAFs, UDTFs Nested / Complex Recursive CTEs BLOB (BINARY), CLOB (String) Common Table Expressions Avro NOT NULL Constraints Date, Time UNION DISTINCT JSON Default Values DATE, TIMESTAMP, Interval Types Advanced Analytics XML Multi-statement Transactions Complex Types OLAP and Windowing Functions Custom Formats ARRAY / MAP / STRUCT / UNION OLAP: Partition, Order by UDAF Other Features Nested Data Analytics CUBE and Grouping Sets XPath Analytics Nested Data Traversal ACID Transactions Lateral Views INSERT / UPDATE / DELETE Procedural Extensions Constraints HPL/SQL Primary / Foreign Key (Non Validated) Apache Hive: Journey to SQL:2011 Analytics Legend New Future work Hive 2 Track Hive SQL:2011 Complete: HIVE-13554
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2 with LLAP: Architecture Overview Deep Storage YARN Cluster LLAP Daemon Query Executors LLAP Daemon Query Executors LLAP Daemon Query Executors LLAP Daemon Query Executors Query Coordinators Coord- inator Coord- inator Coord- inator HiveServer2 (Query Endpoint) ODBC / JDBC SQL Queries In-Memory Cache (Shared Across All Users) HDFS and Compatible S3 WASB Isilon
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 0 5 10 15 20 25 30 35 40 45 50 0 50 100 150 200 250 Speedup(xFactor) QueryTime(s)(LowerisBetter) Hive 2 with LLAP averages 26x faster than Hive 1 Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor) Hive 2 with LLAP: 25+x Performance Boost: Interactive / 1TB Scale
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive vs. Apache Impala at 10TB  10TB scale on 10 identical AWS nodes.  Hive and Impala showed similar times on most smaller queries.  Hive scaled better, with many queries completing in <2m where Impala ran to timeout (3000s). Highlights
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive vs. Presto on a partitioned 1TB dataset.  Presto lacks basic performance optimizations like dynamic partition pruning.  On a real dataset / workload Presto perform poorly without full re-writes.  Example: Query 55 without re-writes = 185.17s, with re- writes = 16s. LLAP = 1.37s. Highlights
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive LLAP: Stable Performance under High Concurrency 4x Queries, 2.8x Runtime Difference 5x Queries, 4.6x Runtime Difference Mark Concurrent Queries Average Runtime 5 7.76s 25 36.24s 100 102.89s
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How Much Can it Hold, and Where? High Storage Capacity
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Storage  Of course HDFS, default in the Hadoop world  More and more cloud  Move is copy in S3, but current implementation assumes move is atomic and nearly free – modifying Hadoop (HADOOP-11694) and Hive (HIVE-14535)  ACID in the cloud – Compactor moves a lot of files around, need to optimize – Need to figure out how streaming ingest works in the cloud  LLAP, caching much more valuable in the cloud – Looking at flushing cache to SSD so misses are less costly
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Is My Data Safe? Security
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved • Wire encryption • HDFS encryption + Ranger KMS • Centralized audit reporting w/ Apache Ranger • Fine grain access control with Apache Ranger Security today in Hadoop Authorization What can I do? Audit What did I do? Data Protection Can data be encrypted at rest and over the wire? • Kerberos • API security with Apache Knox Authentication Who am I/prove it? Centralized Security Administration w/ Ranger & Knox
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Authentication—API Security with Knox • Eliminates SSH “edge node” • Central API management • Central audit control • Service level authorization • SSO - SAMLv2, Siteminder and OAM • LDAP and AD integration • SSO for Hadoop UIs (Ranger, Ambari..) Apache Knox extends the reach of Hadoop REST API without Kerberos complexities Integrated with existing IdM systems Single, simple point of access for a cluster Centralized and consistent secure API across one or more clusters • Kerberos Encapsulation • Single Hadoop access point • REST API hierarchy • Consolidated API calls • Multi-cluster support
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved LLAP Data Access User ID Region Total Spend 1 East 5,131 2 East 27,828 3 West 55,493 4 West 7,193 5 East 18,193 Apache Ranger: Per-User Row Filtering by Region in Hive User 2 (East Region) User 1 (West Region) Original Query: SELECT * from CUSTOMERS WHERE total_spend > 10000 Query Rewrites based on Dynamic Ranger Policies Dynamic Rewrite: SELECT * from CUSTOMERS WHERE total_spend > 10000 AND region = “east” Dynamic Rewrite: SELECT * from CUSTOMERS WHERE total_spend > 10000 AND region = “west”
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Ranger: Dynamic Data Masking of Hive Columns R A N G E R Protect Sensitive Data in real-time with Dynamic Data Masking/Obfuscation! Goal: Mask or anonymize sensitive columns of data (e.g. PII, PCI, PHI) from Hive query output ⬢ Benefits – Sensitive information never leaves database – No changes are required at the application or Hive layer – No need to produce additional protected duplicate versions of datasets – Simple & easy to setup masking policies ⬢ Core Technologies: Ranger, Hive AT L A S H I V E
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dynamic Tag-based Access Policies with Apache Atlas • Basic Tag policy – PII example. Access and entitlements must be tag based ABAC and scalable in implementation. • Geo-based policy – Policy based on IP address, proxy IP substitution maybe required. The rule enforcement must be geo aware. • Time-based policy – Timer for data access, de- coupled from deletion of data. • Prohibitions – Prevention of combination of Hive tables that may pose a risk together. Key Benefits: New scalable metadata based security paradigm Dynamic, real-time policy Active protection – fast updates to changes Centralized and simple to manage policy
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What’s There and Where Did It Come From? Governance
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sqoop Teradata Connector Apache Kafka Apache Atlas: Cross-Component Dataset Lineage Custom Activity Reporter Metadata Repository RDBMS Any process using Sqoop is covered No other tool tracks IOT out of the box
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Atlas Enables Business Catalog for Ease of Use  Organize data assets along business terms – Authoritative: Hierarchical Taxonomy Creation – Agile modeling: Model Conceptual, Logical, Physical assets – Definition and assignment of tags like PII (Personally Identifiable Information)  Comprehensive features for compliance – Multiple user profiles including Data Steward and Business Analysts – Object auditing to track “Who did it” – Metadata Versioning to track ”what did they do”  Faster Insight: – Data Quality tab for profiling and sampling – User Comments Key Benefits: Organize data assets along business terms Compliance Features Faster Insight
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How Will My Users Interact With It? Support for BI, Cubes, Data Science
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid: Deep Multidimensional Analytics Real-Time Analytics Hive / Spark BI Tools REST API Superset UI Events Logs Trans- actions Sensors Historical Sources HDFS S3 Druid Data Cubes Ultra-Fast Analytics Slice-and-Dice Streaming Sources Storm Kafka Spark Deep, Fast Drilldown Across Any Dimension Scalably Ingest Historical Data from Transactional and Web Systems = Future
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid’s Role in Scalable Data Warehousing UI Core Platform S3 or HDFS HiveServer2 MDX Unified SQL and MDX Layer SQL BI Tools MDX Tools Hive Realtime Feeds (Kafka, Storm, etc.) Druid OLAP Indexes HiveServer2 Hive SQL Thrift Server SparkSQL Fast SQL MDX Superset UI Fast Exploration Ranger Atlas Ambari Management
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Analytics at Scale with No Data Movement Syncsort High-Performance Data Movement Hadoop Scalable Storage and Compute Hive LLAP High Performance SQL AtScale Intelligence Platform OLAP Cubes for Higher Performance Source Data Systems Fast, scalable SQL analytics Intelligent in-memory caching Define OLAP cubes for 10x faster queries Unified semantic layer for all BI tools High performance data import from all major EDW platforms Pre-aggregated data ... Or, full-fidelity data
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Column Security with LLAP  Fine-Grained Column Level Access Control for SparkSQL.  Fully dynamic policies per user. Doesn’t require views.  Use Standard Ranger policies and tools to control access and masking policies. Flow: 1. SparkSQL gets data locations known as “splits” from HiveServer and plans query. 2. HiveServer2 authorizes access using Ranger. Per-user policies like row filtering are applied. 3. Spark gets a modified query plan based on dynamic security policy. 4. Spark reads data from LLAP. Filtering / masking guaranteed by LLAP server. HiveServer2 Authorization Hive Metastore Data Locations View Definitions LLAP Data Read Filter Pushdown Ranger Server Dynamic Policies Spark Client 1 2 4 3
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin, Attaches to Hive and Spark
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved But Wait, There’s More Monitoring & Management Data Lifecycle Management Replication & D/R Data Ingestion
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scalable Data Warehousing on Hadoop Capabilities Batch SQL OLAP / CubeInteractive SQL Sub-Second SQL ACID / MERGE Applications • ETL • Reporting • Data Mining • Deep Analytics • Multidimensional Analytics • MDX Tools • Excel • Reporting • BI Tools: Tableau, Microstrategy, Cognos • Ad-Hoc • Drill-Down • BI Tools: Tableau, Excel • Continuous Ingestion from Operational DBMS • Slowly Changing Dimensions Existing Development Emerging Legend Core Platform Scale-Out Storage Petabyte Scale Processing Core SQL Engine Apache Tez: Scalable Distributed Processing Advanced Cost-Based Optimizer Connectivity Advanced Security JDBC / ODBC Comprehensive SQL:2011 Coverage MDX
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved For More Details  Today – LLAP: Building Cloud First BI – 5:50pm  Wednesday – An Overview of Optimization in Apache Hive: Past, Present, Future – 5:00pm  Thursday – Transactional SQL in Apache Hive – 3:00pm

Editor's Notes

  1. Notes: This being Apache there are 50 ways to assemble this, I’m going to cover one There are a lot of parts in the picture, I won’t be able to cover them all For several of these I want to look at what’s there now, and what communities are working on to improve this experience
  2. What tools/processes have you tried to attach to Hadoop and have been unable to do so? Why?
  3. Hive version = Hive 2.1 Presto version = Presto 0.163
  4. Additional details: Queries are 20 interactive queries taken from TPC-DS using 1 TB of data from TPC-DS schema. Queries run at random using a jmeter test harness
  5. Extend reach of Hadoop APIs Gateway for Hadoop’s REST API Different REST APIs have varying levels of AuthN, AuthZ, SSL, SSO Capability Enterprise authentication Apply Enterprise capabilities to All REST APIs – IdM Integration, SSO, OAuth, SAML Avoid exposing Cluster port, hostnames to all users
  6. Show – clearly identify customer metadata. Change Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagnosis ** bring meta from external systems into hadoop – keep it together
  7. Maps well to Yahoo case
  8. Notes: This being Apache there are 50 ways to assemble this, I’m going to cover one There are a lot of parts in the picture, I won’t be able to cover them all For several of these I want to look at what’s there now, and what communities are working on to improve this experience