SQL-H a new way to enable SQL analytics

SQL-H: A New Way to Enable SQL
Analytics on Hadoop
Sushil Thomas
June 2012

Outline

•  HCatalog primer
•  Aster primer
•  SQL-H definition and features
•  SQL-H example usage

2 Confidential and proprietary. Copyright © 2011 Teradata Corporation.

HCatalog Primer
•  HCatalog provides table management and storage
management for Apache Hadoop
-  Provides a shared schema and data type mechanism
-  Provides a table abstraction so that users need not be concerned
with where or how their data is stored
-  Provides interoperability across data processing tools such as Pig,
Map Reduce, Streaming, and Hive

•  Uses Hive-like DDL commands. Supports tables, views,
partitions.

•  Provides parallel load and store interfaces

•  Agnostic to file format of stored data
-  Currently supports RCFile, CSV text, JSON text, and SequenceFile


HCatalog Primer: Example Syntax

!
CREATE EXTERNAL TABLE apachelog (!
host STRING, identity STRING, user STRING,!
time STRING, request STRING, status STRING,!
size STRING, referer STRING, agent STRING)!
ROW FORMAT!
SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe’!
WITH SERDEPROPERTIES ("input.regex" = "([^]*) …”)!
STORED AS TEXTFILE!
LOCATION ‘hdfs://data/apachelogs’;!
!
Note: This is run via HCatalog interfaces to record the format of data
stored in HDFS for later use by Hive, Pig etc. This is not run on the Aster
system.
!

HCatalog Primer: Read Flow (Hadoop Job
Submission)

Job Controller HCatalog Server Node

Table Name,
Partitions
HCatalog
Server
Splits


HCatalog Primer: Read Flow (Hadoop Job
Execution)

Processing Nodes (running Hive, Pig or MR jobs)

Map Task Map Task Map Task
Tuples Tuples Tuples

Split Split Split
…
Source Data Source Data Source Data


Aster Primer

ARC Data
Engine Partition
Inter
… Cluster
SQL-MapReduce Parser ARC Data Express
Engine Partition
Optimizer
Worker Nodes

Executor ARC Data
Engine Partition Inter
SQL Engine
… Cluster
Queen Node ARC Data Express
Engine Partition


Aster SQL-H

•  Direct access to HCatalog data within AsterDB
-  HCatalog tables available without duplicating DDL commands on
the Aster side

•  HCatalog tables are first class objects within AsterDB
-  Full support for all SQL operators

•  We use the HCatalog interfaces to read tuples in parallel on all
data nodes


Aster Reads From HCatalog (Planning)

Aster Optimizer
HCatalog Server Node

Table Name,
Partitions
HCatalog
Server
Splits

Query Planning Phase


Aster Reads From HCatalog (Execution)

HDFS Split ARC Data
Data Tuples
Nodes Split
Engine Partition

HDFS Split ARC Data
Data Tuples Engine Partition
Nodes Split

HDFS Split ARC Data
Data Tuples Engine Partition
Nodes Split

Execution Phase On A Single Worker Node


Features – Simple and Comprehensive Support

•  Interactions with HCatalog master server and HDFS only
-  No MapReduce slots used
-  Hadoop system can be used for other activity simultaneously

•  Aster runs native HCatalog InputReader code for translating
HCatalog table names into input splits, and then getting data
from input splits
-  No impedance mismatch between the two systems
-  Everything supported by HCatalog interfaces is supported in Aster

•  Changes made on HCatalog are reflected immediately on the
Aster side
-  New tables, modified schemas, new partitions etc. are available
immediately. No extra steps required.


Features - Usability

•  Full integration with BI tools
-  Tableau, MSTR etc. now work with data in Hadoop seamlessly

•  Data in Hadoop can now be joined with relational data in your
Aster system
-  Previously, using data from multiple systems involved complex ETL
tasks

•  Full SQL support
-  HCatalog table data can be inserted into a SQL flow just like native
table data

•  If desired, provides a load pipeline into Aster from Hadoop


Features – Teradata Aster Analytical Foundation

•  Full suite of Aster Analytical Foundation functions available for
data in Hadoop
-  Time-Series/Path Analysis
-  Statistical Analysis
-  Relational Analysis
-  Text Analysis
-  Clustering Analysis
-  Data Transformations

•  Makes users productive faster

•  Spend time analyzing data, not building functionality and tools


Features - Performance

•  Partition pruning is transparently supported
-  select * from hadoop_weblogs where ds=‘2012-06-10’
•  If “hadoop_weblogs” is partitioned on ‘ds’, then this command will only
scan data in this particular partition

•  Performance Notes
-  Data transfer is required, but the network may not be your
bottleneck. Time taken for the initial data read may be a small part
of overall query performance
-  Aster’s native SQL execution engine is a lot faster than Hive’s MR
based execution engine
-  As queries get complex, performance advantage increases
-  If required, impact on hadoop system and network bandwidth
usage can be tuned down


Example SQL Syntax – Remote Catalog
beehive=> extl host=hcatalog1.asterdata.com !
List of databases!
Name !
----------!
prod !
testdb !
(2 rows)!
!
beehive=> extd host=hcatalog1.asterdata.com database=prod!
List of tables!
Name !
---------!
apachelogs !
movieratings !
(2 rows)!


Example SQL Syntax – HCatalog Data Access

SELECT * FROM load_from_hcatalog(!
! ON mr_driver !
server(’hcatalog1.asterdata.com’)!
! dbname(‘prod’)!
! tablename(‘student’)!
! columns(‘userid’, ’movieid’, ‘rating’));!
!
!
CREATE VIEW hadoop_weblogs AS!
SELECT * FROM load_from_hcatalog(!
ON mr_driver!
. . .);!


Example SQL Syntax – Data Load From HCatalog

CREATE TABLE aster_weblogs DISTRIBUTE BY HASH(userid) AS!
SELECT * FROM hadoop_weblogs;!


Example SQL Join Syntax – Complex Queries

// Join example!
!
select t1.name, t2.page_url, t1.price !
from !
aster_product t1, !
hadoop_weblogs t2 !
where t1.product_id=t2.product_id;!
!
!
!


Example SQL-MapReduce Syntax
// Find all the sessions with a particular page visit pattern where!
// atleast 3 products have been checked out during the session!
!
SELECT * FROM npath(!
ON hadoop_weblogs!
PARTITION BY sessionid ORDER BY clicktime!
MODE(nonoverlapping) !
PATTERN(‘h.h*.d*.c{3,}.d’)!
SYMBOLS(pagetype = ‘home’ as h, pagetype=‘checkout’ as c,!
pagetype<>’home’ and pagetype<>’checkout’ as d)!
RESULT(first(sessionid of c) as sessionid,!
max_choose(productprice, productname of c) as most_expensive,!
max(productprice of c) as max_price,!
min_choose(productprice, productname of c) as least_expensive, !
min(productprice of c) as min_price))!
ORDER BY sessionid;!


Example BI Tool Usage – Path Analysis on Data
Stored in Aster and Hadoop


SQL-H a new way to enable SQL analytics

SQL-H a new way to enable SQL analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SQL-H a new way to enable SQL analytics

Similar to SQL-H a new way to enable SQL analytics (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

SQL-H a new way to enable SQL analytics