SoCal BigData Day

1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
SQL on Hadoop- Batch, Interactive and Beyond
SoCal Big Data Day
John Park
Solution Engineer, Hortonworks
Rm 138-140

Disclaimer
This document may contain product features and technology directions that are under development, may be
under development in the future or may ultimately not be developed.
Project capabilities are based on information that is publicly available within the Apache Software Foundation
project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release
through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache
Software Foundation community development process can all effect timing and final delivery.
This document’s description of these features and technology directions does not represent a contractual
commitment, promise or obligation from Hortonworks to deliver these features in any generally available
product.
Product features and technology directions are subject to change, and must not be included in contracts,
purchase orders, or sales agreements of any kind.
Since this document contains an outline of general product development plans, customers should not rely upon
it when making purchasing decisions.

Presenter John Park
• Solution Engineer, SoCal
• Data Science ETL, data warehousing,
software design, architecture
• Previous – Various Startups, Qlik,
DW consultant, NCR
• Current – Helping customers
implement and understand open
source big data platforms
• Twitter: @jpark328
• Email: jpark@hortonworks.com

Before We Began
• We have a Raffle
• 2 winner at the end of
presentation
• Prize – Amazon Echo Dot
• Ask Questions
https://www.surveymonkey.com/r/940amSQLHadoopBatch
Survey Link

SQL is King
 Why ?
– Familiarity
• Primary Technical language or Business Analyst
– Powerful
• Maturation of RDBMS, EDW, OLTP
• ACID Compliant
– Flexible
• Covers Transactional Processes to Analytics
– Pervasive
• Emergence of BI tools(Tableau, BOBJ, Cognos),
• Deep ecosystem of tools

Overview of SQL on Hadoop Solutions
Spark's module for working with structured data. Run SQL
queries alongside complex analytic algorithms.
Apache Hive is a data warehouse infrastructure built on
top of Hadoop for providing data summarization, query,
and analysis.
High performance relational database layer over HBase for
low latency applications.
Traditional
MPP on
Hadoop
Many traditionally architected MPP solutions have been
ported to Hadoop and some new ones have been
developed from scratch.

SQL on Hadoop: Vitals
Project First GA Release
Lines of Code
(June 2015)(*) Most Typical Use
Apache Hive April, 2009 (7 Years) 1 Million EDW / ETL Offload
SparkSQL March, 2015 (2 years) 56.6k Exploratory Analytics
Apache Phoenix March, 2014 (3 Year) 200k Low-Latency
Dashboards

Apache Hive: Fast Facts
Most Queries Per Hour
100,000 Queries Per Hour
(Yahoo Japan)
Analytics Performance
100 Million rows/s Per Node
(with Hive LLAP)
Largest Hive Warehouse
300+ PB Raw Storage
(Facebook)
Largest Cluster
4,500+ Nodes
(Yahoo)

Phoenix and HBase: Fast Facts
Largest Database
5 Petabytes
(Flurry)
Best Known App
Facebook Messages
(Facebook)
Fastest Ingestion
10 Million Events/s
(Yahoo)
Biggest SQL App
Real-Time SQL on 140m+ Records
(PubMatic)

Apache Hive: Strengths and Cautions
• Huge Datasets
• Deep SQL Analytics
• EDW Offload
• BI Integration
Strengths+
• Near-Real-Time
Cautions?

SparkSQL: Strengths and Cautions
• Language-Integrated Query
• Exploratory Analytics
Strengths+
• Large Datasets
• High Concurrency
• EDW Offload
Cautions?

Apache Phoenix: Strengths and Cautions
• Near-Real-Time Query
• Fast Updates
Strengths+
• Deep SQL Analytics
• Full-Table Scans / Scaled Analytics
• Existing BI Integrations
Cautions?

SQL on Hadoop - Good to know
 No One Size Fits all solution
 Use Cases and Query Patterns are important
 Prototype and Fail Fast
 Define Scalability and Performance criteria

Hive: Analytics Use Case
 Financial Services Company:
– Analyze large dataset to identify potential fraud.
– Re platformed from a mature EDW platform.
– Selection drivers: Breadth of SQL support, query performance, cloud consumption.
 Use Case Vitals:
– Analyze > 25 billion transactions per week.
– More than 1.5 TB new data per day.
– > 4PB historical data available for analysis through cloud infrastructure.

Hive Performance with Scaling - Customer results on HDP 2.2
0
500
1,000
1,500
2,000
2,500
3,000
Multi join - Allocation Aggregation Total
Elapsedtime(seconds)
Scalability on Hive
5 nodes 10 nodes 20 nodes 40 nodes 60 nodes
Benchmark test 5 nodes 10 nodes 20 nodes 40 nodes* 60 nodes*
Multi join 24:02 14:33 10:32 06:54 05:49
Aggregation 21:59 12:20 07:55 05:16 02:38
Total 46:02 26:53 18:27 12:10 08:27
Same Workload on EDW -- Full Rack 8:00
(*) Projected times based on 5, 10 and 20 node results.
Aggregation Workload
• 5% more time required on
Hive.
• < 50% solution cost versus
traditional EDW.

SparkSQL Use Case: Medical
Sensor Data HDFS
Aggregations
(Hive)
HCatalog
Analytical Tools
JDBC Connector
SparkSQL
- Sensor data streamed into HDFS
- Large-scale pre-aggregations done using Hive
- SparkSQL powered dashboard for fast analytics.

+
Phoenix at PubMatic
Near-Real-Time SQL over >15 TB of Data
Using Apache Phoenix

Apache Phoenix at PubMatic
Key Concerns Solution
PubMatic offers marketing automation with real-time
analytics that enable publishers to make smarter and
faster decisions.
To empower publishers to make real-time decisions,
PubMatic needs a SQL solution that scales to
terabytes of data yet can process hundreds of
thousands of queries daily with near-real-time SLAs.
Phoenix is the only Open Source SQL Solution for
Hadoop designed for near-real-time querying, giving
PubMatic’s publishers the timely insight they need to
optimize their advertising strategies.
Phoenix’s linear scalability enables PubMatic to offer
real-time query over more than 15 terabytes of data
using commodity hardware.
Phoenix’s ANSI SQL Interface make it easy for
publishers to slice and dice data the way they want.
Read more at http://phoenix.apache.org/who_is_using.html

Evolution of Hive
Batch/ETL
(HDP 2.2)
• Transactions with ACID allowing
insert, update and delete
• Temporary Tables
• Cost Based Optimizer optimizes
complex join queries well.
Faster SQL
• Tech Preview: Sub-5-Second
queries with LLAP
• Usability: SQL Query Editor, Visual
Explain and Debugging
• Transparent Data Encryption
• Cross-Site Replication
• SQL, Performance Improvements
• Hive-on-Spark (Alpha / Beta)
Sub-Second with
Rich Analytics
• Rich SQL:2011 Analytics
• Tech Preview : Druid OLAP Index for
Hive
• GA: Sub-Second queries with LLAP
• Transaction Improvements
(BEGIN/COMMIT/ROLLBACK,
MERGE)
Phase 1
(Delivered: HDP 2.2)
Phase 2
(Delivered: HDP 2.5)
Phase 3
(Planned: HDP 2.6*)

Apache Hive: Modern Architecture
Storage
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Engine
SQL Engines
Row Engine Vector Engine
SQL
SQL Support
SQL:2011 Optimizer HCatalog HiveServer2
Cache
Block Cache
Linux Cache
Distributed
Execution
Hadoop 1
MapReduce
Hadoop 2
Tez Spark
Vector Cache
LLAP
Persistent Server
Historical
Current
In Development
Legend

Sub-Second Hive with LLAP
Sub Second:
• LLAP: Persistent server to instantly execute SQL queries.
• Caches hottest data in RAM.
• Overcomes latencies associated with Hive on Tez or Hive on Spark.
SQL Compatibility:
• 100% Compatible with Hive SQL.
• Compatible with existing tools (BI, ETL, etc.)
Security:
• Security via HiveServer2.
• Integrates with Apache Ranger.
Hadoop
Node
Hadoop
Node
Hadoop
Node
Vector
Cache
LLAP
Server
Vector
Cache
LLAP
Server
Vector
Cache
LLAP
Server
Hive
Sever2
LLAP Servers
(1 Per Hadoop Node)
Hive SQL

Hive 2 with LLAP: Architecture Overview
Deep
Storage
HDFS
S3 + Other HDFS
Compatible Filesystems
YARN Cluster
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
HiveServer2
(Query
Endpoint)
ODBC /
JDBC
SQL
Queries In-Memory Cache
(Shared Across All Users)

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Types SQL Features File Formats Futures
Numeric Core SQL Features Columnar ACID MERGE
FLOAT, DOUBLE Date, Time and Arithmetical Functions ORCFile Multi Subquery
DECIMAL INNER, OUTER, CROSS and SEMI Joins Parquet Scalar Subqueries
INT, TINYINT, SMALLINT, BIGINT Derived Table Subqueries Text Non-Equijoins
BOOLEAN Correlated + Uncorrelated Subqueries CSV INTERSECT / EXCEPT
String UNION ALL Logfile
CHAR, VARCHAR UDFs, UDAFs, UDTFs Nested / Complex Recursive CTEs
BLOB (BINARY), CLOB (String) Common Table Expressions Avro NOT NULL Constraints
Date, Time UNION DISTINCT JSON Default Values
DATE, TIMESTAMP, Interval Types Advanced Analytics XML Multi Table Transactions
Complex Types OLAP and Windowing Functions Custom Formats
ARRAY / MAP / STRUCT / UNION OLAP: Partition, Order by UDAF Other Features
Nested Data Analytics CUBE and Grouping Sets XPath Analytics
Nested Data Traversal ACID Transactions
Lateral Views INSERT / UPDATE / DELETE
Procedural Extensions Constraints
HPL/SQL Primary / Foreign Key (Non Validated)
Apache Hive: Journey to SQL:2011 Analytics
Legend
New
Projected: HDP 3.0
HDP 2.6
Track Hive SQL:2011 Complete: HIVE-13554

Phoenix SQL: Today and Tomorrow
Phoenix: SQL for HBase
SQL Datatypes (VARCHAR, INTEGER,
etc.)
UNION ALL
JOINs: Inner, Left/Right Outer, Cross Functional Indexes
UPSERT / DELETE Date / Time Functions
Derived Tables UDFs
GROUP BY, ORDER BY, HAVING Multi Table Transactions
AVG, COUNT, MIN, MAX, SUM SQL GRANT / REVOKE
Primary keys, NOT NULL constraints Replication Management
CASE, COALESCE Column Constraints and Defaults
VIEWs OLAP, Cubing, Rollup
Secondary Indexes UNION
Flexible Schema
Current
Future
Phoenix 4.4

Looking forward - What Is Druid?
Druid is a distributed, real-time, column-oriented datastore
designed to quickly ingest and index large amounts of data
and make it available for real-time query.
Features:
• Streaming Data Ingestion
• Sub-Second Queries
• Merge Historical and Real-Time Data
• Approximate Computation

Druid’s Role in Scalable Data Warehousing
UI
Core Platform
S3 or HDFS
HiveServer2
MDX
Unified SQL and MDX Layer
SQL BI Tools MDX Tools
Hive
Realtime Feeds
(Kafka, Storm, etc.)
Druid
OLAP Indexes
HiveServer2
Hive SQL
Thrift Server
SparkSQL
Fast SQL MDX
Superset UI
Fast Exploration
Builder UI
SmartSense
Ranger
Atlas
Ambari
Management

0
5
10
15
20
25
30
35
40
45
50
0
50
100
150
200
250
Speedup(xFactor)
QueryTime(s)(LowerisBetter)
Hive 2 with LLAP averages 26x faster than Hive 1
Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)
Hive 2 with LLAP: 26x Performance Boost

SQL on Hadoop: Investment Areas
Interactive Performance
Caching in Flash / SSD
Fast Analytics on Raw Text
Materialized Views
SQL Compliance
Comprehensive SQL:2011
Support
SQL ACID
SQL Standard MERGE
EDW Integrations
Joint AtScale / Syncsort Roadmap
OLAP Indexes with Druid

SQL on Hadoop Summary
Project Strengths Use Cases Unique Capabilities
Apache Hive • Most Comprehensive SQL
• Scale
• Maturity
• ETL Offload
• Reporting
• Large-Scale Aggregations
• Robust Cost-Based
Optimizer
• Mature Ecosystem (BI,
Backup, Security,
Replication)
SparkSQL • In-Memory
• Low Latency
• Exploratory Analytics
• Dashboards
• Language-Integrated
Query
Apache Phoenix • Real-Time Read/Write
• Transactions
• Dashboards
• System-of-Engagement
• Drill-Down / Drill-Up
• Real-Time Read/Write

Scalable Data Warehousing on Hadoop: Overview
Other ETL
Tools
Ingest and Store ETL, Data Mining,
Advanced Analytics
Interactive SQL,
Reporting, OLAP
Kafka
HDFS
NiFi Druid
(Future)
Hive
LLAP
HAWQ
AtScale
Spark
Hive
HPL /
SQL
ACID
Atlas
Governance and
Lineage
Ranger
Advanced
Security
Syncsort
DMX-h
ETL
Zeppelin
Ambari
Hive View
BI Tools
Reporting
Tools

Thank You
https://www.surveymonkey.com/r/940amSQLHadoopBatch

SoCal BigData Day

More Related Content

What's hot

Viewers also liked

Similar to SoCal BigData Day

Recently uploaded

SoCal BigData Day