Page 1 © Hortonworks Inc. 2014
Discover HDP 2.1
Interactive SQL Query in Hadoop with Apache Hive
Hortonworks. We do Hadoop.
Page 2 © Hortonworks Inc. 2014
Speakers
Justin Sears
Hortonworks Product Marketing Manager
Carter Shanklin
Hortonworks Director of Product
Management & PM for Apache Hive in
Hortonworks Data Platform
Owen O’Malley
Hortonworks Co-Founder, Engineer &
Committer for Apache Hive project
Page 3 © Hortonworks Inc. 2014
OPERATIONS	
  TOOLS	
  
Provision,
Manage &
Monitor
DEV	
  &	
  DATA	
  TOOLS	
  
Build &
Test
A Modern Data ArchitectureAPPLICATIONS	
  DATA	
  	
  SYSTEM	
  
REPOSITORIES	
  
RDBMS	
   EDW	
   MPP	
  
Business	
  	
  
Analy<cs	
  
Custom	
  
Applica<ons	
  
Packaged	
  
Applica<ons	
  
Governance
&Integration
ENTERPRISE HADOOP
Security
Operations
Data Access
Data Management
SOURCES	
  
OLTP,	
  ERP,	
  
CRM	
  Systems	
  
Documents,	
  	
  
Emails	
  
Web	
  Logs,	
  
Click	
  Streams	
  
Social	
  
Networks	
  
Machine	
  
Generated	
  
Sensor	
  
Data	
  
GeolocaCon	
  
Data	
  
Page 4 © Hortonworks Inc. 2014
HDP 2.1: Enterprise Hadoop
HDP 2.1
Hortonworks Data Platform
HDP 2.1
Hortonworks Data Platform
	
  	
  
Provision,	
  
Manage	
  &	
  
Monitor	
  
	
  
Ambari	
  
Zookeeper	
  
Scheduling	
  
	
  
Oozie	
  
Data	
  Workflow,	
  
Lifecycle	
  &	
  
Governance	
  
	
  
Falcon	
  
Sqoop	
  
Flume	
  
NFS	
  
WebHDFS	
  
YARN	
  :	
  Data	
  Opera<ng	
  System	
  
DATA	
  	
  MANAGEMENT	
  
DATA	
  	
  ACCESS	
  
GOVERNANCE	
  &	
  
INTEGRATION	
  
OPERATIONS	
  
Script	
  
	
  
Pig	
  
	
  
	
  
Search	
  
	
  
Solr	
  
	
  
	
  
SQL	
  
	
  
Hive/Tez,	
  
HCatalog	
  
	
  
	
  
NoSQL	
  
	
  
HBase	
  
Accumulo	
  
	
  
	
  
Stream	
  
	
  	
  
Storm	
  
	
  
	
  
	
  
Others	
  
	
  
In-­‐Memory	
  
AnalyCcs,	
  	
  
ISV	
  engines	
  
1	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
  
°	
  
N	
  
HDFS	
  	
  
(Hadoop	
  Distributed	
  File	
  System)	
  
Batch	
  
	
  
Map	
  
Reduce	
  
	
  
	
  
SECURITY	
  
Authen<ca<on	
  
Authoriza<on	
  
Accoun<ng	
  
Data	
  Protec<on	
  
	
  
Storage:	
  HDFS	
  
Resources:	
  YARN	
  
Access:	
  Hive,	
  …	
  	
  
Pipeline:	
  Falcon	
  
Cluster:	
  Knox	
  
Page 5 © Hortonworks Inc. 2014
HDP 2.1: Enterprise Hadoop
HDP 2.1
Hortonworks Data Platform
HDP 2.1
Hortonworks Data Platform
	
  	
  
Provision,	
  
Manage	
  &	
  
Monitor	
  
	
  
Ambari	
  
Zookeeper	
  
Scheduling	
  
	
  
Oozie	
  
Data	
  Workflow,	
  
Lifecycle	
  &	
  
Governance	
  
	
  
Falcon	
  
Sqoop	
  
Flume	
  
NFS	
  
WebHDFS	
  
DATA	
  	
  MANAGEMENT	
  
GOVERNANCE	
  &	
  
INTEGRATION	
  
OPERATIONS	
  
Script	
  
	
  
Pig	
  
	
  
	
  
Search	
  
	
  
Solr	
  
	
  
	
  
NoSQL	
  
	
  
HBase	
  
Accumulo	
  
	
  
	
  
Stream	
  
	
  	
  
Storm	
  
	
  
	
  
	
  
Others	
  
	
  
In-­‐Memory	
  
AnalyCcs,	
  	
  
ISV	
  engines	
  
1	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
  
°	
  
N	
  
HDFS	
  	
  
(Hadoop	
  Distributed	
  File	
  System)	
  
Batch	
  
	
  
Map	
  
Reduce	
  
	
  
	
  
SECURITY	
  
Authen<ca<on	
  
Authoriza<on	
  
Accoun<ng	
  
Data	
  Protec<on	
  
	
  
Storage:	
  HDFS	
  
Resources:	
  YARN	
  
Access:	
  Hive,	
  …	
  	
  
Pipeline:	
  Falcon	
  
Cluster:	
  Knox	
  
YARN	
  :	
  Data	
  Opera<ng	
  System	
  
DATA	
  	
  ACCESS	
  
SQL	
  
	
  
Hive/Tez,	
  
HCatalog	
  
	
  
	
  
Page 6 © Hortonworks Inc. 2014
Apache Hive After the Stinger Initiative:
Speed, Scale & SQL Compliance
Page 7 © Hortonworks Inc. 2014
Hive: SQL Analytics For Any Data Size
Sensor	
  Mobile	
  
Weblog	
  
OperaConal	
  
/	
  MPP	
  
Store	
  and	
  Query	
  all	
  
Data	
  in	
  Hive	
  
Use	
  Exis<ng	
  SQL	
  Tools	
  
and	
  Exis<ng	
  SQL	
  Processes	
  
SQL	
  
Queries	
  
Page 8 © Hortonworks Inc. 2014
The Stinger Initiative: Complete
• Community initiative around Hive
• Enables Hive to support interactive workloads
• Enhances Hive’s standard SQL interface for Hadoop
• Improves existing tools & preserves investments
Query
Processing
Vectorized
Query
Execution
Engine
Tez
= 100X+ +
File
Format
ORCFile
Page 9 © Hortonworks Inc. 2014
New in Hive HDP 2.1: Speed
New Features for Speed
Interactive query using Hive on Tez
Vectorized query execution
Cost-based optimizer
Page 10 © Hortonworks Inc. 2014
New in HDP 2.1: More Than 10 New SQL Features
New SQL Features
Subquery for IN / NOT IN
Support for EXISTS and NOT EXISTS
Common table expressions (CTEs)
Support for CHAR datatype
Scale and precision support for DECIMAL datatype
JOIN conditions in the WHERE clause
Cancel jobs via ODBC / JDBC
Support for Unicode column names
Permanent functions
Stream data into Hive from Flume (Experimental feature)
Page 11 © Hortonworks Inc. 2014
Hive’s Journey to SQL Compliance
Evolu<on	
  of	
  SQL	
  Compliance	
  in	
  Hive	
  
SQL	
  Datatypes	
   SQL	
  SemanCcs	
  
INT/TINYINT/SMALLINT/BIGINT	
   SELECT,	
  INSERT	
  
FLOAT/DOUBLE	
   GROUP	
  BY,	
  ORDER	
  BY,	
  HAVING	
  
BOOLEAN	
   JOIN	
  on	
  explicit	
  join	
  key	
  
ARRAY,	
  MAP,	
  STRUCT,	
  UNION	
   Inner,	
  outer,	
  cross	
  and	
  semi	
  joins	
  
STRING	
   Sub-­‐queries	
  in	
  the	
  FROM	
  clause	
  
BINARY	
   ROLLUP	
  and	
  CUBE	
  
TIMESTAMP	
   UNION	
  
DECIMAL	
   Standard	
  aggregaCons	
  (sum,	
  avg,	
  etc.)	
  
DATE	
   Custom	
  Java	
  UDFs	
  
VARCHAR	
   Windowing	
  funcCons	
  (OVER,	
  RANK,	
  etc.)	
  
CHAR	
   Advanced	
  UDFs	
  (ngram,	
  XPath,	
  URL)	
  
Interval	
  Types	
   Sub-­‐queries	
  for	
  IN/NOT	
  IN,	
  HAVING	
  
JOINs	
  in	
  WHERE	
  Clause	
  
Common	
  Table	
  Expressions	
  (WITH	
  Clause)	
  
INSERT	
  /	
  UPDATE	
  /	
  DELETE	
  
Legend	
  
Available	
  
Roadmap	
  
Hive	
  11	
  
Hive	
  12	
  
Hive	
  13	
  
Page 12 © Hortonworks Inc. 2014
New in HDP 2.1: Other Improvements
Other New Hive Features
SQL standard authorization
Hive job visualizer in Ambari
PAM authentication support
SSL encryption support in HiveServer2
Dynamic partition scalability
Page 13 © Hortonworks Inc. 2014
Demo
Page 14 © Hortonworks Inc. 2014
FoodMart Dataset
• FoodMart Dataset, replicated 275 times (~ 10GB data)
• Queries run locally on an HDP 2.1 Sandbox.
• Queries to do some customer analytics.
sales_fact_1997 customer
Other
Dimension
Tables
time_by_day
Page 15 © Hortonworks Inc. 2014
Learn More About Hive & The Stinger Initiative
Hortonworks.com/labs/stinger/
Register for the remaining 5
Discover HDP 2.1 Webinars
Hortonworks.com/
webinars
Next Webinar:
Apache Falcon for
Data Governance in Hadoop
Wednesday, May 21, 10am
Pacific

Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive

  • 1.
    Page 1 ©Hortonworks Inc. 2014 Discover HDP 2.1 Interactive SQL Query in Hadoop with Apache Hive Hortonworks. We do Hadoop.
  • 2.
    Page 2 ©Hortonworks Inc. 2014 Speakers Justin Sears Hortonworks Product Marketing Manager Carter Shanklin Hortonworks Director of Product Management & PM for Apache Hive in Hortonworks Data Platform Owen O’Malley Hortonworks Co-Founder, Engineer & Committer for Apache Hive project
  • 3.
    Page 3 ©Hortonworks Inc. 2014 OPERATIONS  TOOLS   Provision, Manage & Monitor DEV  &  DATA  TOOLS   Build & Test A Modern Data ArchitectureAPPLICATIONS  DATA    SYSTEM   REPOSITORIES   RDBMS   EDW   MPP   Business     Analy<cs   Custom   Applica<ons   Packaged   Applica<ons   Governance &Integration ENTERPRISE HADOOP Security Operations Data Access Data Management SOURCES   OLTP,  ERP,   CRM  Systems   Documents,     Emails   Web  Logs,   Click  Streams   Social   Networks   Machine   Generated   Sensor   Data   GeolocaCon   Data  
  • 4.
    Page 4 ©Hortonworks Inc. 2014 HDP 2.1: Enterprise Hadoop HDP 2.1 Hortonworks Data Platform HDP 2.1 Hortonworks Data Platform     Provision,   Manage  &   Monitor     Ambari   Zookeeper   Scheduling     Oozie   Data  Workflow,   Lifecycle  &   Governance     Falcon   Sqoop   Flume   NFS   WebHDFS   YARN  :  Data  Opera<ng  System   DATA    MANAGEMENT   DATA    ACCESS   GOVERNANCE  &   INTEGRATION   OPERATIONS   Script     Pig       Search     Solr       SQL     Hive/Tez,   HCatalog       NoSQL     HBase   Accumulo       Stream       Storm         Others     In-­‐Memory   AnalyCcs,     ISV  engines   1   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   N   HDFS     (Hadoop  Distributed  File  System)   Batch     Map   Reduce       SECURITY   Authen<ca<on   Authoriza<on   Accoun<ng   Data  Protec<on     Storage:  HDFS   Resources:  YARN   Access:  Hive,  …     Pipeline:  Falcon   Cluster:  Knox  
  • 5.
    Page 5 ©Hortonworks Inc. 2014 HDP 2.1: Enterprise Hadoop HDP 2.1 Hortonworks Data Platform HDP 2.1 Hortonworks Data Platform     Provision,   Manage  &   Monitor     Ambari   Zookeeper   Scheduling     Oozie   Data  Workflow,   Lifecycle  &   Governance     Falcon   Sqoop   Flume   NFS   WebHDFS   DATA    MANAGEMENT   GOVERNANCE  &   INTEGRATION   OPERATIONS   Script     Pig       Search     Solr       NoSQL     HBase   Accumulo       Stream       Storm         Others     In-­‐Memory   AnalyCcs,     ISV  engines   1   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   N   HDFS     (Hadoop  Distributed  File  System)   Batch     Map   Reduce       SECURITY   Authen<ca<on   Authoriza<on   Accoun<ng   Data  Protec<on     Storage:  HDFS   Resources:  YARN   Access:  Hive,  …     Pipeline:  Falcon   Cluster:  Knox   YARN  :  Data  Opera<ng  System   DATA    ACCESS   SQL     Hive/Tez,   HCatalog      
  • 6.
    Page 6 ©Hortonworks Inc. 2014 Apache Hive After the Stinger Initiative: Speed, Scale & SQL Compliance
  • 7.
    Page 7 ©Hortonworks Inc. 2014 Hive: SQL Analytics For Any Data Size Sensor  Mobile   Weblog   OperaConal   /  MPP   Store  and  Query  all   Data  in  Hive   Use  Exis<ng  SQL  Tools   and  Exis<ng  SQL  Processes   SQL   Queries  
  • 8.
    Page 8 ©Hortonworks Inc. 2014 The Stinger Initiative: Complete • Community initiative around Hive • Enables Hive to support interactive workloads • Enhances Hive’s standard SQL interface for Hadoop • Improves existing tools & preserves investments Query Processing Vectorized Query Execution Engine Tez = 100X+ + File Format ORCFile
  • 9.
    Page 9 ©Hortonworks Inc. 2014 New in Hive HDP 2.1: Speed New Features for Speed Interactive query using Hive on Tez Vectorized query execution Cost-based optimizer
  • 10.
    Page 10 ©Hortonworks Inc. 2014 New in HDP 2.1: More Than 10 New SQL Features New SQL Features Subquery for IN / NOT IN Support for EXISTS and NOT EXISTS Common table expressions (CTEs) Support for CHAR datatype Scale and precision support for DECIMAL datatype JOIN conditions in the WHERE clause Cancel jobs via ODBC / JDBC Support for Unicode column names Permanent functions Stream data into Hive from Flume (Experimental feature)
  • 11.
    Page 11 ©Hortonworks Inc. 2014 Hive’s Journey to SQL Compliance Evolu<on  of  SQL  Compliance  in  Hive   SQL  Datatypes   SQL  SemanCcs   INT/TINYINT/SMALLINT/BIGINT   SELECT,  INSERT   FLOAT/DOUBLE   GROUP  BY,  ORDER  BY,  HAVING   BOOLEAN   JOIN  on  explicit  join  key   ARRAY,  MAP,  STRUCT,  UNION   Inner,  outer,  cross  and  semi  joins   STRING   Sub-­‐queries  in  the  FROM  clause   BINARY   ROLLUP  and  CUBE   TIMESTAMP   UNION   DECIMAL   Standard  aggregaCons  (sum,  avg,  etc.)   DATE   Custom  Java  UDFs   VARCHAR   Windowing  funcCons  (OVER,  RANK,  etc.)   CHAR   Advanced  UDFs  (ngram,  XPath,  URL)   Interval  Types   Sub-­‐queries  for  IN/NOT  IN,  HAVING   JOINs  in  WHERE  Clause   Common  Table  Expressions  (WITH  Clause)   INSERT  /  UPDATE  /  DELETE   Legend   Available   Roadmap   Hive  11   Hive  12   Hive  13  
  • 12.
    Page 12 ©Hortonworks Inc. 2014 New in HDP 2.1: Other Improvements Other New Hive Features SQL standard authorization Hive job visualizer in Ambari PAM authentication support SSL encryption support in HiveServer2 Dynamic partition scalability
  • 13.
    Page 13 ©Hortonworks Inc. 2014 Demo
  • 14.
    Page 14 ©Hortonworks Inc. 2014 FoodMart Dataset • FoodMart Dataset, replicated 275 times (~ 10GB data) • Queries run locally on an HDP 2.1 Sandbox. • Queries to do some customer analytics. sales_fact_1997 customer Other Dimension Tables time_by_day
  • 15.
    Page 15 ©Hortonworks Inc. 2014 Learn More About Hive & The Stinger Initiative Hortonworks.com/labs/stinger/ Register for the remaining 5 Discover HDP 2.1 Webinars Hortonworks.com/ webinars Next Webinar: Apache Falcon for Data Governance in Hadoop Wednesday, May 21, 10am Pacific