• Save
SQL-H a new way to enable SQL analytics
Upcoming SlideShare
Loading in...5
×
 

SQL-H a new way to enable SQL analytics

on

  • 8,420 views

 

Statistics

Views

Total Views
8,420
Views on SlideShare
8,349
Embed Views
71

Actions

Likes
12
Downloads
0
Comments
1

2 Embeds 71

http://eventifier.co 63
http://eventifier.com 8

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • good
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

SQL-H a new way to enable SQL analytics SQL-H a new way to enable SQL analytics Presentation Transcript

  • SQL-H: A New Way to Enable SQLAnalytics on HadoopSushil ThomasJune 2012
  • Outline•  HCatalog primer•  Aster primer•  SQL-H definition and features•  SQL-H example usage2 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • HCatalog Primer•  HCatalog provides table management and storage management for Apache Hadoop -  Provides a shared schema and data type mechanism -  Provides a table abstraction so that users need not be concerned with where or how their data is stored -  Provides interoperability across data processing tools such as Pig, Map Reduce, Streaming, and Hive•  Uses Hive-like DDL commands. Supports tables, views, partitions.•  Provides parallel load and store interfaces•  Agnostic to file format of stored data -  Currently supports RCFile, CSV text, JSON text, and SequenceFile3 Confidential and proprietary. Copyright © 2011 Teradata Corporation. View slide
  • HCatalog Primer: Example Syntax!CREATE EXTERNAL TABLE apachelog (! host STRING, identity STRING, user STRING,! time STRING, request STRING, status STRING,! size STRING, referer STRING, agent STRING)!ROW FORMAT!SERDE org.apache.hadoop.hive.contrib.serde2.RegexSerDe’!WITH SERDEPROPERTIES ("input.regex" = "([^]*) …”)!STORED AS TEXTFILE!LOCATION ‘hdfs://data/apachelogs’;!!Note: This is run via HCatalog interfaces to record the format of datastored in HDFS for later use by Hive, Pig etc. This is not run on the Astersystem.!4 Confidential and proprietary. Copyright © 2011 Teradata Corporation. View slide
  • HCatalog Primer: Read Flow (Hadoop JobSubmission) Job Controller HCatalog Server Node Table Name, Partitions HCatalog Server Splits5 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • HCatalog Primer: Read Flow (Hadoop JobExecution)Processing Nodes (running Hive, Pig or MR jobs) Map Task Map Task Map Task Tuples Tuples Tuples Split Split Split … Source Data Source Data Source Data6 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • Aster Primer ARC Data Engine Partition Inter … ClusterSQL-MapReduce Parser ARC Data Express Engine Partition Optimizer Worker Nodes Executor ARC Data Engine Partition Inter SQL Engine … Cluster Queen Node ARC Data Express Engine Partition 7 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • Aster SQL-H•  Direct access to HCatalog data within AsterDB -  HCatalog tables available without duplicating DDL commands on the Aster side•  HCatalog tables are first class objects within AsterDB -  Full support for all SQL operators•  We use the HCatalog interfaces to read tuples in parallel on all data nodes8 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • Aster Reads From HCatalog (Planning) Aster Optimizer HCatalog Server Node Table Name, Partitions HCatalog Server Splits Query Planning Phase9 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • Aster Reads From HCatalog (Execution)HDFS Split ARC DataData TuplesNodes Split Engine PartitionHDFS Split ARC DataData Tuples Engine PartitionNodes SplitHDFS Split ARC DataData Tuples Engine PartitionNodes Split Execution Phase On A Single Worker Node10 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • Features – Simple and Comprehensive Support•  Interactions with HCatalog master server and HDFS only -  No MapReduce slots used -  Hadoop system can be used for other activity simultaneously•  Aster runs native HCatalog InputReader code for translating HCatalog table names into input splits, and then getting data from input splits -  No impedance mismatch between the two systems -  Everything supported by HCatalog interfaces is supported in Aster•  Changes made on HCatalog are reflected immediately on the Aster side -  New tables, modified schemas, new partitions etc. are available immediately. No extra steps required.11 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • Features - Usability•  Full integration with BI tools -  Tableau, MSTR etc. now work with data in Hadoop seamlessly•  Data in Hadoop can now be joined with relational data in your Aster system -  Previously, using data from multiple systems involved complex ETL tasks•  Full SQL support -  HCatalog table data can be inserted into a SQL flow just like native table data•  If desired, provides a load pipeline into Aster from Hadoop12 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • Features – Teradata Aster Analytical Foundation•  Full suite of Aster Analytical Foundation functions available for data in Hadoop -  Time-Series/Path Analysis -  Statistical Analysis -  Relational Analysis -  Text Analysis -  Clustering Analysis -  Data Transformations•  Makes users productive faster•  Spend time analyzing data, not building functionality and tools13 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • Features - Performance•  Partition pruning is transparently supported -  select * from hadoop_weblogs where ds=‘2012-06-10’ •  If “hadoop_weblogs” is partitioned on ‘ds’, then this command will only scan data in this particular partition•  Performance Notes -  Data transfer is required, but the network may not be your bottleneck. Time taken for the initial data read may be a small part of overall query performance -  Aster’s native SQL execution engine is a lot faster than Hive’s MR based execution engine -  As queries get complex, performance advantage increases -  If required, impact on hadoop system and network bandwidth usage can be tuned down14 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • Example SQL Syntax – Remote Catalogbeehive=> extl host=hcatalog1.asterdata.com !List of databases! Name !----------! prod ! testdb !(2 rows)! !beehive=> extd host=hcatalog1.asterdata.com database=prod!List of tables! Name !---------! apachelogs ! movieratings !(2 rows)!15 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • Example SQL Syntax – Remote Catalogbeehive=> extd host=hcatalog1.asterdata.com database=prodtable=movieratings! Table ”prod".”movieratings"!Table ”prod".”movieratings"!Name | Type | Partitioned Column !---------+---------+--------------------!userid | string | f!movieid | int | f!rating | double | f!ds | string | t!(4 rows)!16 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • Example SQL Syntax – HCatalog Data AccessSELECT * FROM load_from_hcatalog(! ! ON mr_driver ! server(’hcatalog1.asterdata.com’)! ! dbname(‘prod’)! ! tablename(‘student’)! ! columns(‘userid’, ’movieid’, ‘rating’));!!!CREATE VIEW hadoop_weblogs AS! SELECT * FROM load_from_hcatalog(! ON mr_driver! . . .);!17 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • Example SQL Syntax – Data Load From HCatalogCREATE TABLE aster_weblogs DISTRIBUTE BY HASH(userid) AS! SELECT * FROM hadoop_weblogs;!18 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • Example SQL Syntax – Partition Pruningbeehive=> extd host=hcatalog1.asterdata.com database=prodtable=movieratings!Table ”prod".”movieratings"!Name | Type | Partitioned Column !---------+---------+--------------------!userid | string | f!movieid | int | f!rating | double | f!ds | string | t!(4 rows)!!!// Because ‘ds’ is a partitioned column, the query below!// will only pull in data from the ‘2011-06-10’ partition!SELECT * FROM hadoop_movieratings! WHERE ds=‘2011-06-10’;!19 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • Example SQL Join Syntax – Complex Queries// Join example!!select t1.name, t2.page_url, t1.price !from ! aster_product t1, ! hadoop_weblogs t2 !where t1.product_id=t2.product_id;!!!!20 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • Example SQL-MapReduce Syntax// Find all the sessions with a particular page visit pattern where!// atleast 3 products have been checked out during the session!!SELECT * FROM npath(! ON hadoop_weblogs! PARTITION BY sessionid ORDER BY clicktime! MODE(nonoverlapping) ! PATTERN(‘h.h*.d*.c{3,}.d’)! SYMBOLS(pagetype = ‘home’ as h, pagetype=‘checkout’ as c,! pagetype<>’home’ and pagetype<>’checkout’ as d)! RESULT(first(sessionid of c) as sessionid,! max_choose(productprice, productname of c) as most_expensive,! max(productprice of c) as max_price,! min_choose(productprice, productname of c) as least_expensive, ! min(productprice of c) as min_price))!ORDER BY sessionid;!21 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • Example BI Tool Usage – Path Analysis on DataStored in Aster and Hadoop22 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • Example BI Tool Usage – Path Analysis on DataStored in Aster and Hadoop23 Confidential and proprietary. Copyright © 2011 Teradata Corporation.