SlideShare a Scribd company logo
1 of 24
SQL-H: A New Way to Enable SQL
Analytics on Hadoop
Sushil Thomas
June 2012
Outline


•    HCatalog primer
•    Aster primer
•    SQL-H definition and features
•    SQL-H example usage




2      Confidential and proprietary. Copyright © 2011 Teradata Corporation.
HCatalog Primer
•  HCatalog provides table management and storage
   management for Apache Hadoop
    -  Provides a shared schema and data type mechanism
    -  Provides a table abstraction so that users need not be concerned
       with where or how their data is stored
    -  Provides interoperability across data processing tools such as Pig,
       Map Reduce, Streaming, and Hive


•  Uses Hive-like DDL commands. Supports tables, views,
   partitions.

•  Provides parallel load and store interfaces

•  Agnostic to file format of stored data
    -  Currently supports RCFile, CSV text, JSON text, and SequenceFile

3     Confidential and proprietary. Copyright © 2011 Teradata Corporation.
HCatalog Primer: Example Syntax

!
CREATE EXTERNAL TABLE apachelog (!
       host STRING, identity STRING, user STRING,!
       time STRING, request STRING, status STRING,!
       size STRING, referer STRING, agent STRING)!
ROW FORMAT!
SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe’!
WITH SERDEPROPERTIES ("input.regex" = "([^]*) …”)!
STORED AS TEXTFILE!
LOCATION ‘hdfs://data/apachelogs’;!
!
Note: This is run via HCatalog interfaces to record the format of data
stored in HDFS for later use by Hive, Pig etc. This is not run on the Aster
system.
!
4   Confidential and proprietary. Copyright © 2011 Teradata Corporation.
HCatalog Primer: Read Flow (Hadoop Job
Submission)


        Job Controller                                                     HCatalog Server Node

                      Table Name,
                      Partitions
                                                                              HCatalog
                                                                              Server
                          Splits




5   Confidential and proprietary. Copyright © 2011 Teradata Corporation.
HCatalog Primer: Read Flow (Hadoop Job
Execution)

Processing Nodes (running Hive, Pig or MR jobs)


    Map Task                                        Map Task                Map Task
            Tuples                                               Tuples         Tuples

            Split                                                Split          Split
                                                                                           …
     Source Data                                         Source Data         Source Data




6    Confidential and proprietary. Copyright © 2011 Teradata Corporation.
Aster Primer

                                                                      ARC        Data
                                                                      Engine     Partition
                                                                                             Inter
                                                                     …                       Cluster
SQL-MapReduce     Parser                                              ARC        Data        Express
                                                                      Engine     Partition
                  Optimizer
                                                               Worker Nodes

                  Executor                                           ARC         Data
                                                                     Engine      Partition   Inter
                SQL Engine
                                                                     …                       Cluster
        Queen Node                                                   ARC         Data        Express
                                                                     Engine      Partition

    7     Confidential and proprietary. Copyright © 2011 Teradata Corporation.
Aster SQL-H

•  Direct access to HCatalog data within AsterDB
    -  HCatalog tables available without duplicating DDL commands on
       the Aster side


•  HCatalog tables are first class objects within AsterDB
    -  Full support for all SQL operators


•  We use the HCatalog interfaces to read tuples in parallel on all
   data nodes




8     Confidential and proprietary. Copyright © 2011 Teradata Corporation.
Aster Reads From HCatalog (Planning)



    Aster Optimizer
                                                                 HCatalog Server Node

              Table Name,
              Partitions
                                                                         HCatalog
                                                                         Server
                  Splits




    Query Planning Phase

9     Confidential and proprietary. Copyright © 2011 Teradata Corporation.
Aster Reads From HCatalog (Execution)


HDFS                 Split                                               ARC Data
Data                                        Tuples
Nodes                Split
                                                                       Engine Partition


HDFS                 Split                                                ARC Data
Data                                        Tuples                      Engine Partition
Nodes                Split



HDFS                  Split                                              ARC Data
Data                                          Tuples                   Engine Partition
Nodes                 Split


          Execution Phase On A Single Worker Node

10      Confidential and proprietary. Copyright © 2011 Teradata Corporation.
Features – Simple and Comprehensive Support

•  Interactions with HCatalog master server and HDFS only
     -  No MapReduce slots used
     -  Hadoop system can be used for other activity simultaneously


•  Aster runs native HCatalog InputReader code for translating
   HCatalog table names into input splits, and then getting data
   from input splits
     -  No impedance mismatch between the two systems
     -  Everything supported by HCatalog interfaces is supported in Aster


•  Changes made on HCatalog are reflected immediately on the
   Aster side
     -  New tables, modified schemas, new partitions etc. are available
        immediately. No extra steps required.


11     Confidential and proprietary. Copyright © 2011 Teradata Corporation.
Features - Usability

•  Full integration with BI tools
     -  Tableau, MSTR etc. now work with data in Hadoop seamlessly


•  Data in Hadoop can now be joined with relational data in your
   Aster system
     -  Previously, using data from multiple systems involved complex ETL
        tasks


•  Full SQL support
     -  HCatalog table data can be inserted into a SQL flow just like native
        table data


•  If desired, provides a load pipeline into Aster from Hadoop


12     Confidential and proprietary. Copyright © 2011 Teradata Corporation.
Features – Teradata Aster Analytical Foundation

•  Full suite of Aster Analytical Foundation functions available for
   data in Hadoop
     -  Time-Series/Path Analysis
     -  Statistical Analysis
     -  Relational Analysis
     -  Text Analysis
     -  Clustering Analysis
     -  Data Transformations


•  Makes users productive faster

•  Spend time analyzing data, not building functionality and tools



13     Confidential and proprietary. Copyright © 2011 Teradata Corporation.
Features - Performance

•  Partition pruning is transparently supported
     -  select * from hadoop_weblogs where ds=‘2012-06-10’
       •  If “hadoop_weblogs” is partitioned on ‘ds’, then this command will only
          scan data in this particular partition


•  Performance Notes
     -  Data transfer is required, but the network may not be your
        bottleneck. Time taken for the initial data read may be a small part
        of overall query performance
     -  Aster’s native SQL execution engine is a lot faster than Hive’s MR
        based execution engine
     -  As queries get complex, performance advantage increases
     -  If required, impact on hadoop system and network bandwidth
        usage can be tuned down



14     Confidential and proprietary. Copyright © 2011 Teradata Corporation.
Example SQL Syntax – Remote Catalog
beehive=> extl host=hcatalog1.asterdata.com !
List of databases!
 Name     !
----------!
 prod     !
 testdb     !
(2 rows)!
 !
beehive=> extd host=hcatalog1.asterdata.com database=prod!
List of tables!
 Name !
---------!
 apachelogs   !
 movieratings   !
(2 rows)!

15   Confidential and proprietary. Copyright © 2011 Teradata Corporation.
Example SQL Syntax – Remote Catalog
beehive=> extd host=hcatalog1.asterdata.com database=prod
table=movieratings!
     Table ”prod".”movieratings"!
Table ”prod".”movieratings"!
Name      | Type    | Partitioned Column !
---------+---------+--------------------!
userid    | string | f!
movieid | int       | f!
rating    | double | f!
ds        | string | t!
(4 rows)!




16   Confidential and proprietary. Copyright © 2011 Teradata Corporation.
Example SQL Syntax – HCatalog Data Access

SELECT * FROM load_from_hcatalog(!
      !   ON mr_driver !
          server(’hcatalog1.asterdata.com’)!
      !   dbname(‘prod’)!
      !   tablename(‘student’)!
      !   columns(‘userid’, ’movieid’, ‘rating’));!
!
!
CREATE VIEW hadoop_weblogs AS!
            SELECT * FROM load_from_hcatalog(!
                     ON mr_driver!
                     . . .);!




17   Confidential and proprietary. Copyright © 2011 Teradata Corporation.
Example SQL Syntax – Data Load From HCatalog


CREATE TABLE aster_weblogs DISTRIBUTE BY HASH(userid) AS!
             SELECT * FROM hadoop_weblogs;!




18   Confidential and proprietary. Copyright © 2011 Teradata Corporation.
Example SQL Syntax – Partition Pruning
beehive=> extd host=hcatalog1.asterdata.com database=prod
table=movieratings!
Table ”prod".”movieratings"!
Name      | Type    | Partitioned Column !
---------+---------+--------------------!
userid    | string | f!
movieid | int       | f!
rating    | double | f!
ds        | string | t!
(4 rows)!
!
!
// Because ‘ds’ is a partitioned column, the query below!
// will only pull in data from the ‘2011-06-10’ partition!
SELECT * FROM hadoop_movieratings!
          WHERE ds=‘2011-06-10’;!
19   Confidential and proprietary. Copyright © 2011 Teradata Corporation.
Example SQL Join Syntax – Complex Queries


// Join example!
!
select t1.name, t2.page_url, t1.price                                       !
from !
   aster_product t1, !
   hadoop_weblogs t2 !
where t1.product_id=t2.product_id;!
!
!
!




20   Confidential and proprietary. Copyright © 2011 Teradata Corporation.
Example SQL-MapReduce Syntax
// Find all the sessions with a particular page visit pattern where!
// atleast 3 products have been checked out during the session!
!
SELECT * FROM npath(!
      ON hadoop_weblogs!
      PARTITION BY sessionid ORDER BY clicktime!
      MODE(nonoverlapping) !
      PATTERN(‘h.h*.d*.c{3,}.d’)!
   SYMBOLS(pagetype = ‘home’ as h, pagetype=‘checkout’ as c,!
           pagetype<>’home’ and pagetype<>’checkout’ as d)!
   RESULT(first(sessionid of c) as sessionid,!
        max_choose(productprice, productname of c) as most_expensive,!
        max(productprice of c) as max_price,!
        min_choose(productprice, productname of c) as least_expensive, !
        min(productprice of c) as min_price))!
ORDER BY sessionid;!


21   Confidential and proprietary. Copyright © 2011 Teradata Corporation.
Example BI Tool Usage – Path Analysis on Data
Stored in Aster and Hadoop




22   Confidential and proprietary. Copyright © 2011 Teradata Corporation.
Example BI Tool Usage – Path Analysis on Data
Stored in Aster and Hadoop




23   Confidential and proprietary. Copyright © 2011 Teradata Corporation.
SQL-H a new way to enable SQL analytics

More Related Content

What's hot

Informatica World 2006 - MDM Data Quality
Informatica World 2006 - MDM Data QualityInformatica World 2006 - MDM Data Quality
Informatica World 2006 - MDM Data QualityDatabase Architechs
 
Bi Is Not An Isolated Decision
Bi Is Not An Isolated DecisionBi Is Not An Isolated Decision
Bi Is Not An Isolated DecisionJoseph Lopez
 
Sap sap so h 2013
Sap sap so h 2013Sap sap so h 2013
Sap sap so h 2013deepersnet
 
Innovation Webinar - Using IFS Applications BI to drive business excellence
Innovation Webinar - Using IFS Applications BI to drive business excellenceInnovation Webinar - Using IFS Applications BI to drive business excellence
Innovation Webinar - Using IFS Applications BI to drive business excellenceIFS
 
Agile Business Intelligence
Agile Business IntelligenceAgile Business Intelligence
Agile Business IntelligenceDon Jackson
 
From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012Anand Deshpande
 
Empowering the Business with Agile Analytics
Empowering the Business with Agile AnalyticsEmpowering the Business with Agile Analytics
Empowering the Business with Agile AnalyticsInside Analysis
 
Big Data i CSC's optik, CSC Representative
Big Data i CSC's optik, CSC RepresentativeBig Data i CSC's optik, CSC Representative
Big Data i CSC's optik, CSC RepresentativeIBM Danmark
 
Open Source Solution
Open Source SolutionOpen Source Solution
Open Source Solutionittishait
 
Metadata Use Cases You Can Use
Metadata Use Cases You Can UseMetadata Use Cases You Can Use
Metadata Use Cases You Can Usedmurph4
 
Innovations in SAP BusinessObjects 4.0
Innovations in SAP BusinessObjects 4.0Innovations in SAP BusinessObjects 4.0
Innovations in SAP BusinessObjects 4.0Pierre Leroux
 
Tera stream for datastreams
Tera stream for datastreamsTera stream for datastreams
Tera stream for datastreams치민 최
 
Saleseffectivity and business intelligence
Saleseffectivity and business intelligenceSaleseffectivity and business intelligence
Saleseffectivity and business intelligencemarekdan
 
B13 Driving Business Intelligence John Robson
B13 Driving Business Intelligence John RobsonB13 Driving Business Intelligence John Robson
B13 Driving Business Intelligence John RobsonProvoke Solutions
 
Rationalizing an Enterprise IT Architecture
Rationalizing an Enterprise IT ArchitectureRationalizing an Enterprise IT Architecture
Rationalizing an Enterprise IT ArchitectureBob Rhubart
 
Database Architecture Proposal
Database Architecture ProposalDatabase Architecture Proposal
Database Architecture ProposalDATANYWARE.com
 
Sap Supplier Risk Performance 2011
Sap Supplier Risk  Performance 2011Sap Supplier Risk  Performance 2011
Sap Supplier Risk Performance 2011Henner Schliebs
 

What's hot (20)

Informatica World 2006 - MDM Data Quality
Informatica World 2006 - MDM Data QualityInformatica World 2006 - MDM Data Quality
Informatica World 2006 - MDM Data Quality
 
Bi Is Not An Isolated Decision
Bi Is Not An Isolated DecisionBi Is Not An Isolated Decision
Bi Is Not An Isolated Decision
 
Sap sap so h 2013
Sap sap so h 2013Sap sap so h 2013
Sap sap so h 2013
 
Mobile Analytics
Mobile AnalyticsMobile Analytics
Mobile Analytics
 
Innovation Webinar - Using IFS Applications BI to drive business excellence
Innovation Webinar - Using IFS Applications BI to drive business excellenceInnovation Webinar - Using IFS Applications BI to drive business excellence
Innovation Webinar - Using IFS Applications BI to drive business excellence
 
Agile Business Intelligence
Agile Business IntelligenceAgile Business Intelligence
Agile Business Intelligence
 
Cv D Pietrzak Dpbc En
Cv D Pietrzak Dpbc EnCv D Pietrzak Dpbc En
Cv D Pietrzak Dpbc En
 
From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012
 
Empowering the Business with Agile Analytics
Empowering the Business with Agile AnalyticsEmpowering the Business with Agile Analytics
Empowering the Business with Agile Analytics
 
Big Data i CSC's optik, CSC Representative
Big Data i CSC's optik, CSC RepresentativeBig Data i CSC's optik, CSC Representative
Big Data i CSC's optik, CSC Representative
 
Open Source Solution
Open Source SolutionOpen Source Solution
Open Source Solution
 
Metadata Use Cases You Can Use
Metadata Use Cases You Can UseMetadata Use Cases You Can Use
Metadata Use Cases You Can Use
 
Innovations in SAP BusinessObjects 4.0
Innovations in SAP BusinessObjects 4.0Innovations in SAP BusinessObjects 4.0
Innovations in SAP BusinessObjects 4.0
 
Tera stream for datastreams
Tera stream for datastreamsTera stream for datastreams
Tera stream for datastreams
 
Saleseffectivity and business intelligence
Saleseffectivity and business intelligenceSaleseffectivity and business intelligence
Saleseffectivity and business intelligence
 
B13 Driving Business Intelligence John Robson
B13 Driving Business Intelligence John RobsonB13 Driving Business Intelligence John Robson
B13 Driving Business Intelligence John Robson
 
Kaizentric Presentation
Kaizentric PresentationKaizentric Presentation
Kaizentric Presentation
 
Rationalizing an Enterprise IT Architecture
Rationalizing an Enterprise IT ArchitectureRationalizing an Enterprise IT Architecture
Rationalizing an Enterprise IT Architecture
 
Database Architecture Proposal
Database Architecture ProposalDatabase Architecture Proposal
Database Architecture Proposal
 
Sap Supplier Risk Performance 2011
Sap Supplier Risk  Performance 2011Sap Supplier Risk  Performance 2011
Sap Supplier Risk Performance 2011
 

Similar to SQL-H a new way to enable SQL analytics

Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisFelicia Haggarty
 
Performance evaluation of cloudera impala 0.6 beta with comparison to Hive
Performance evaluation of cloudera impala 0.6 beta with comparison to HivePerformance evaluation of cloudera impala 0.6 beta with comparison to Hive
Performance evaluation of cloudera impala 0.6 beta with comparison to HiveYukinori Suda
 
【旧版】Oracle Exadata Cloud Service:サービス概要のご紹介 [2020年8月版]
【旧版】Oracle Exadata Cloud Service:サービス概要のご紹介 [2020年8月版]【旧版】Oracle Exadata Cloud Service:サービス概要のご紹介 [2020年8月版]
【旧版】Oracle Exadata Cloud Service:サービス概要のご紹介 [2020年8月版]オラクルエンジニア通信
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Modern Data Stack France
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing enginebigdatagurus_meetup
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingAll Things Open
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAlluxio, Inc.
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudDataWorks Summit/Hadoop Summit
 
Miro Consulting Oracle Exadata Database Machine Offering
Miro Consulting  Oracle Exadata Database Machine OfferingMiro Consulting  Oracle Exadata Database Machine Offering
Miro Consulting Oracle Exadata Database Machine Offeringgarylcoleman
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalogAdam Muise
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaData Con LA
 
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseOct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseYahoo Developer Network
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeongYousun Jeong
 

Similar to SQL-H a new way to enable SQL analytics (20)

Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
 
Performance evaluation of cloudera impala 0.6 beta with comparison to Hive
Performance evaluation of cloudera impala 0.6 beta with comparison to HivePerformance evaluation of cloudera impala 0.6 beta with comparison to Hive
Performance evaluation of cloudera impala 0.6 beta with comparison to Hive
 
【旧版】Oracle Exadata Cloud Service:サービス概要のご紹介 [2020年8月版]
【旧版】Oracle Exadata Cloud Service:サービス概要のご紹介 [2020年8月版]【旧版】Oracle Exadata Cloud Service:サービス概要のご紹介 [2020年8月版]
【旧版】Oracle Exadata Cloud Service:サービス概要のご紹介 [2020年8月版]
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Miro Consulting Oracle Exadata Database Machine Offering
Miro Consulting  Oracle Exadata Database Machine OfferingMiro Consulting  Oracle Exadata Database Machine Offering
Miro Consulting Oracle Exadata Database Machine Offering
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
HTAP Queries
HTAP QueriesHTAP Queries
HTAP Queries
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseOct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
 
Sybase To Oracle Migration for DBAs
Sybase To Oracle Migration for DBAsSybase To Oracle Migration for DBAs
Sybase To Oracle Migration for DBAs
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Recently uploaded (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

SQL-H a new way to enable SQL analytics

  • 1. SQL-H: A New Way to Enable SQL Analytics on Hadoop Sushil Thomas June 2012
  • 2. Outline •  HCatalog primer •  Aster primer •  SQL-H definition and features •  SQL-H example usage 2 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • 3. HCatalog Primer •  HCatalog provides table management and storage management for Apache Hadoop -  Provides a shared schema and data type mechanism -  Provides a table abstraction so that users need not be concerned with where or how their data is stored -  Provides interoperability across data processing tools such as Pig, Map Reduce, Streaming, and Hive •  Uses Hive-like DDL commands. Supports tables, views, partitions. •  Provides parallel load and store interfaces •  Agnostic to file format of stored data -  Currently supports RCFile, CSV text, JSON text, and SequenceFile 3 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • 4. HCatalog Primer: Example Syntax ! CREATE EXTERNAL TABLE apachelog (! host STRING, identity STRING, user STRING,! time STRING, request STRING, status STRING,! size STRING, referer STRING, agent STRING)! ROW FORMAT! SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe’! WITH SERDEPROPERTIES ("input.regex" = "([^]*) …”)! STORED AS TEXTFILE! LOCATION ‘hdfs://data/apachelogs’;! ! Note: This is run via HCatalog interfaces to record the format of data stored in HDFS for later use by Hive, Pig etc. This is not run on the Aster system. ! 4 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • 5. HCatalog Primer: Read Flow (Hadoop Job Submission) Job Controller HCatalog Server Node Table Name, Partitions HCatalog Server Splits 5 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • 6. HCatalog Primer: Read Flow (Hadoop Job Execution) Processing Nodes (running Hive, Pig or MR jobs) Map Task Map Task Map Task Tuples Tuples Tuples Split Split Split … Source Data Source Data Source Data 6 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • 7. Aster Primer ARC Data Engine Partition Inter … Cluster SQL-MapReduce Parser ARC Data Express Engine Partition Optimizer Worker Nodes Executor ARC Data Engine Partition Inter SQL Engine … Cluster Queen Node ARC Data Express Engine Partition 7 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • 8. Aster SQL-H •  Direct access to HCatalog data within AsterDB -  HCatalog tables available without duplicating DDL commands on the Aster side •  HCatalog tables are first class objects within AsterDB -  Full support for all SQL operators •  We use the HCatalog interfaces to read tuples in parallel on all data nodes 8 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • 9. Aster Reads From HCatalog (Planning) Aster Optimizer HCatalog Server Node Table Name, Partitions HCatalog Server Splits Query Planning Phase 9 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • 10. Aster Reads From HCatalog (Execution) HDFS Split ARC Data Data Tuples Nodes Split Engine Partition HDFS Split ARC Data Data Tuples Engine Partition Nodes Split HDFS Split ARC Data Data Tuples Engine Partition Nodes Split Execution Phase On A Single Worker Node 10 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • 11. Features – Simple and Comprehensive Support •  Interactions with HCatalog master server and HDFS only -  No MapReduce slots used -  Hadoop system can be used for other activity simultaneously •  Aster runs native HCatalog InputReader code for translating HCatalog table names into input splits, and then getting data from input splits -  No impedance mismatch between the two systems -  Everything supported by HCatalog interfaces is supported in Aster •  Changes made on HCatalog are reflected immediately on the Aster side -  New tables, modified schemas, new partitions etc. are available immediately. No extra steps required. 11 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • 12. Features - Usability •  Full integration with BI tools -  Tableau, MSTR etc. now work with data in Hadoop seamlessly •  Data in Hadoop can now be joined with relational data in your Aster system -  Previously, using data from multiple systems involved complex ETL tasks •  Full SQL support -  HCatalog table data can be inserted into a SQL flow just like native table data •  If desired, provides a load pipeline into Aster from Hadoop 12 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • 13. Features – Teradata Aster Analytical Foundation •  Full suite of Aster Analytical Foundation functions available for data in Hadoop -  Time-Series/Path Analysis -  Statistical Analysis -  Relational Analysis -  Text Analysis -  Clustering Analysis -  Data Transformations •  Makes users productive faster •  Spend time analyzing data, not building functionality and tools 13 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • 14. Features - Performance •  Partition pruning is transparently supported -  select * from hadoop_weblogs where ds=‘2012-06-10’ •  If “hadoop_weblogs” is partitioned on ‘ds’, then this command will only scan data in this particular partition •  Performance Notes -  Data transfer is required, but the network may not be your bottleneck. Time taken for the initial data read may be a small part of overall query performance -  Aster’s native SQL execution engine is a lot faster than Hive’s MR based execution engine -  As queries get complex, performance advantage increases -  If required, impact on hadoop system and network bandwidth usage can be tuned down 14 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • 15. Example SQL Syntax – Remote Catalog beehive=> extl host=hcatalog1.asterdata.com ! List of databases! Name ! ----------! prod ! testdb ! (2 rows)! ! beehive=> extd host=hcatalog1.asterdata.com database=prod! List of tables! Name ! ---------! apachelogs ! movieratings ! (2 rows)! 15 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • 16. Example SQL Syntax – Remote Catalog beehive=> extd host=hcatalog1.asterdata.com database=prod table=movieratings! Table ”prod".”movieratings"! Table ”prod".”movieratings"! Name | Type | Partitioned Column ! ---------+---------+--------------------! userid | string | f! movieid | int | f! rating | double | f! ds | string | t! (4 rows)! 16 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • 17. Example SQL Syntax – HCatalog Data Access SELECT * FROM load_from_hcatalog(! ! ON mr_driver ! server(’hcatalog1.asterdata.com’)! ! dbname(‘prod’)! ! tablename(‘student’)! ! columns(‘userid’, ’movieid’, ‘rating’));! ! ! CREATE VIEW hadoop_weblogs AS! SELECT * FROM load_from_hcatalog(! ON mr_driver! . . .);! 17 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • 18. Example SQL Syntax – Data Load From HCatalog CREATE TABLE aster_weblogs DISTRIBUTE BY HASH(userid) AS! SELECT * FROM hadoop_weblogs;! 18 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • 19. Example SQL Syntax – Partition Pruning beehive=> extd host=hcatalog1.asterdata.com database=prod table=movieratings! Table ”prod".”movieratings"! Name | Type | Partitioned Column ! ---------+---------+--------------------! userid | string | f! movieid | int | f! rating | double | f! ds | string | t! (4 rows)! ! ! // Because ‘ds’ is a partitioned column, the query below! // will only pull in data from the ‘2011-06-10’ partition! SELECT * FROM hadoop_movieratings! WHERE ds=‘2011-06-10’;! 19 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • 20. Example SQL Join Syntax – Complex Queries // Join example! ! select t1.name, t2.page_url, t1.price ! from ! aster_product t1, ! hadoop_weblogs t2 ! where t1.product_id=t2.product_id;! ! ! ! 20 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • 21. Example SQL-MapReduce Syntax // Find all the sessions with a particular page visit pattern where! // atleast 3 products have been checked out during the session! ! SELECT * FROM npath(! ON hadoop_weblogs! PARTITION BY sessionid ORDER BY clicktime! MODE(nonoverlapping) ! PATTERN(‘h.h*.d*.c{3,}.d’)! SYMBOLS(pagetype = ‘home’ as h, pagetype=‘checkout’ as c,! pagetype<>’home’ and pagetype<>’checkout’ as d)! RESULT(first(sessionid of c) as sessionid,! max_choose(productprice, productname of c) as most_expensive,! max(productprice of c) as max_price,! min_choose(productprice, productname of c) as least_expensive, ! min(productprice of c) as min_price))! ORDER BY sessionid;! 21 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • 22. Example BI Tool Usage – Path Analysis on Data Stored in Aster and Hadoop 22 Confidential and proprietary. Copyright © 2011 Teradata Corporation.
  • 23. Example BI Tool Usage – Path Analysis on Data Stored in Aster and Hadoop 23 Confidential and proprietary. Copyright © 2011 Teradata Corporation.