More Related Content
Similar to SQL-H a new way to enable SQL analytics (20)
More from DataWorks Summit (20)
SQL-H a new way to enable SQL analytics
- 1. SQL-H: A New Way to Enable SQL
Analytics on Hadoop
Sushil Thomas
June 2012
- 2. Outline
• HCatalog primer
• Aster primer
• SQL-H definition and features
• SQL-H example usage
2 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
- 3. HCatalog Primer
• HCatalog provides table management and storage
management for Apache Hadoop
- Provides a shared schema and data type mechanism
- Provides a table abstraction so that users need not be concerned
with where or how their data is stored
- Provides interoperability across data processing tools such as Pig,
Map Reduce, Streaming, and Hive
• Uses Hive-like DDL commands. Supports tables, views,
partitions.
• Provides parallel load and store interfaces
• Agnostic to file format of stored data
- Currently supports RCFile, CSV text, JSON text, and SequenceFile
3 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
- 4. HCatalog Primer: Example Syntax
!
CREATE EXTERNAL TABLE apachelog (!
host STRING, identity STRING, user STRING,!
time STRING, request STRING, status STRING,!
size STRING, referer STRING, agent STRING)!
ROW FORMAT!
SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe’!
WITH SERDEPROPERTIES ("input.regex" = "([^]*) …”)!
STORED AS TEXTFILE!
LOCATION ‘hdfs://data/apachelogs’;!
!
Note: This is run via HCatalog interfaces to record the format of data
stored in HDFS for later use by Hive, Pig etc. This is not run on the Aster
system.
!
4 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
- 5. HCatalog Primer: Read Flow (Hadoop Job
Submission)
Job Controller HCatalog Server Node
Table Name,
Partitions
HCatalog
Server
Splits
5 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
- 6. HCatalog Primer: Read Flow (Hadoop Job
Execution)
Processing Nodes (running Hive, Pig or MR jobs)
Map Task Map Task Map Task
Tuples Tuples Tuples
Split Split Split
…
Source Data Source Data Source Data
6 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
- 7. Aster Primer
ARC Data
Engine Partition
Inter
… Cluster
SQL-MapReduce Parser ARC Data Express
Engine Partition
Optimizer
Worker Nodes
Executor ARC Data
Engine Partition Inter
SQL Engine
… Cluster
Queen Node ARC Data Express
Engine Partition
7 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
- 8. Aster SQL-H
• Direct access to HCatalog data within AsterDB
- HCatalog tables available without duplicating DDL commands on
the Aster side
• HCatalog tables are first class objects within AsterDB
- Full support for all SQL operators
• We use the HCatalog interfaces to read tuples in parallel on all
data nodes
8 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
- 9. Aster Reads From HCatalog (Planning)
Aster Optimizer
HCatalog Server Node
Table Name,
Partitions
HCatalog
Server
Splits
Query Planning Phase
9 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
- 10. Aster Reads From HCatalog (Execution)
HDFS Split ARC Data
Data Tuples
Nodes Split
Engine Partition
HDFS Split ARC Data
Data Tuples Engine Partition
Nodes Split
HDFS Split ARC Data
Data Tuples Engine Partition
Nodes Split
Execution Phase On A Single Worker Node
10 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
- 11. Features – Simple and Comprehensive Support
• Interactions with HCatalog master server and HDFS only
- No MapReduce slots used
- Hadoop system can be used for other activity simultaneously
• Aster runs native HCatalog InputReader code for translating
HCatalog table names into input splits, and then getting data
from input splits
- No impedance mismatch between the two systems
- Everything supported by HCatalog interfaces is supported in Aster
• Changes made on HCatalog are reflected immediately on the
Aster side
- New tables, modified schemas, new partitions etc. are available
immediately. No extra steps required.
11 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
- 12. Features - Usability
• Full integration with BI tools
- Tableau, MSTR etc. now work with data in Hadoop seamlessly
• Data in Hadoop can now be joined with relational data in your
Aster system
- Previously, using data from multiple systems involved complex ETL
tasks
• Full SQL support
- HCatalog table data can be inserted into a SQL flow just like native
table data
• If desired, provides a load pipeline into Aster from Hadoop
12 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
- 13. Features – Teradata Aster Analytical Foundation
• Full suite of Aster Analytical Foundation functions available for
data in Hadoop
- Time-Series/Path Analysis
- Statistical Analysis
- Relational Analysis
- Text Analysis
- Clustering Analysis
- Data Transformations
• Makes users productive faster
• Spend time analyzing data, not building functionality and tools
13 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
- 14. Features - Performance
• Partition pruning is transparently supported
- select * from hadoop_weblogs where ds=‘2012-06-10’
• If “hadoop_weblogs” is partitioned on ‘ds’, then this command will only
scan data in this particular partition
• Performance Notes
- Data transfer is required, but the network may not be your
bottleneck. Time taken for the initial data read may be a small part
of overall query performance
- Aster’s native SQL execution engine is a lot faster than Hive’s MR
based execution engine
- As queries get complex, performance advantage increases
- If required, impact on hadoop system and network bandwidth
usage can be tuned down
14 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
- 15. Example SQL Syntax – Remote Catalog
beehive=> extl host=hcatalog1.asterdata.com !
List of databases!
Name !
----------!
prod !
testdb !
(2 rows)!
!
beehive=> extd host=hcatalog1.asterdata.com database=prod!
List of tables!
Name !
---------!
apachelogs !
movieratings !
(2 rows)!
15 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
- 16. Example SQL Syntax – Remote Catalog
beehive=> extd host=hcatalog1.asterdata.com database=prod
table=movieratings!
Table ”prod".”movieratings"!
Table ”prod".”movieratings"!
Name | Type | Partitioned Column !
---------+---------+--------------------!
userid | string | f!
movieid | int | f!
rating | double | f!
ds | string | t!
(4 rows)!
16 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
- 17. Example SQL Syntax – HCatalog Data Access
SELECT * FROM load_from_hcatalog(!
! ON mr_driver !
server(’hcatalog1.asterdata.com’)!
! dbname(‘prod’)!
! tablename(‘student’)!
! columns(‘userid’, ’movieid’, ‘rating’));!
!
!
CREATE VIEW hadoop_weblogs AS!
SELECT * FROM load_from_hcatalog(!
ON mr_driver!
. . .);!
17 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
- 18. Example SQL Syntax – Data Load From HCatalog
CREATE TABLE aster_weblogs DISTRIBUTE BY HASH(userid) AS!
SELECT * FROM hadoop_weblogs;!
18 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
- 19. Example SQL Syntax – Partition Pruning
beehive=> extd host=hcatalog1.asterdata.com database=prod
table=movieratings!
Table ”prod".”movieratings"!
Name | Type | Partitioned Column !
---------+---------+--------------------!
userid | string | f!
movieid | int | f!
rating | double | f!
ds | string | t!
(4 rows)!
!
!
// Because ‘ds’ is a partitioned column, the query below!
// will only pull in data from the ‘2011-06-10’ partition!
SELECT * FROM hadoop_movieratings!
WHERE ds=‘2011-06-10’;!
19 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
- 20. Example SQL Join Syntax – Complex Queries
// Join example!
!
select t1.name, t2.page_url, t1.price !
from !
aster_product t1, !
hadoop_weblogs t2 !
where t1.product_id=t2.product_id;!
!
!
!
20 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
- 21. Example SQL-MapReduce Syntax
// Find all the sessions with a particular page visit pattern where!
// atleast 3 products have been checked out during the session!
!
SELECT * FROM npath(!
ON hadoop_weblogs!
PARTITION BY sessionid ORDER BY clicktime!
MODE(nonoverlapping) !
PATTERN(‘h.h*.d*.c{3,}.d’)!
SYMBOLS(pagetype = ‘home’ as h, pagetype=‘checkout’ as c,!
pagetype<>’home’ and pagetype<>’checkout’ as d)!
RESULT(first(sessionid of c) as sessionid,!
max_choose(productprice, productname of c) as most_expensive,!
max(productprice of c) as max_price,!
min_choose(productprice, productname of c) as least_expensive, !
min(productprice of c) as min_price))!
ORDER BY sessionid;!
21 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
- 22. Example BI Tool Usage – Path Analysis on Data
Stored in Aster and Hadoop
22 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
- 23. Example BI Tool Usage – Path Analysis on Data
Stored in Aster and Hadoop
23 Confidential and proprietary. Copyright © 2011 Teradata Corporation.