Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive

3,061 views
2,711 views

Published on

In February 2013, the open source community launched the Stinger Initiative to improve speed, scale and SQL semantics in Apache Hive. After thirteen months of constant, concerted collaboration (and more than 390,000 new lines of Java code) Stinger is complete with Hive 0.13.

In this presentation, Carter Shanklin, Hortonworks director of product management, and Owen O'Malley, Hortonworks co-founder and committer to Apache Hive, discuss how Hive enables interactive query using familiar SQL semantics.

Published in: Software, Technology

Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive

  1. 1. Page 1 © Hortonworks Inc. 2014 Discover HDP 2.1 Interactive SQL Query in Hadoop with Apache Hive Hortonworks. We do Hadoop.
  2. 2. Page 2 © Hortonworks Inc. 2014 Speakers Justin Sears Hortonworks Product Marketing Manager Carter Shanklin Hortonworks Director of Product Management & PM for Apache Hive in Hortonworks Data Platform Owen O’Malley Hortonworks Co-Founder, Engineer & Committer for Apache Hive project
  3. 3. Page 3 © Hortonworks Inc. 2014 OPERATIONS  TOOLS   Provision, Manage & Monitor DEV  &  DATA  TOOLS   Build & Test A Modern Data ArchitectureAPPLICATIONS  DATA    SYSTEM   REPOSITORIES   RDBMS   EDW   MPP   Business     Analy<cs   Custom   Applica<ons   Packaged   Applica<ons   Governance &Integration ENTERPRISE HADOOP Security Operations Data Access Data Management SOURCES   OLTP,  ERP,   CRM  Systems   Documents,     Emails   Web  Logs,   Click  Streams   Social   Networks   Machine   Generated   Sensor   Data   GeolocaCon   Data  
  4. 4. Page 4 © Hortonworks Inc. 2014 HDP 2.1: Enterprise Hadoop HDP 2.1 Hortonworks Data Platform HDP 2.1 Hortonworks Data Platform     Provision,   Manage  &   Monitor     Ambari   Zookeeper   Scheduling     Oozie   Data  Workflow,   Lifecycle  &   Governance     Falcon   Sqoop   Flume   NFS   WebHDFS   YARN  :  Data  Opera<ng  System   DATA    MANAGEMENT   DATA    ACCESS   GOVERNANCE  &   INTEGRATION   OPERATIONS   Script     Pig       Search     Solr       SQL     Hive/Tez,   HCatalog       NoSQL     HBase   Accumulo       Stream       Storm         Others     In-­‐Memory   AnalyCcs,     ISV  engines   1   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   N   HDFS     (Hadoop  Distributed  File  System)   Batch     Map   Reduce       SECURITY   Authen<ca<on   Authoriza<on   Accoun<ng   Data  Protec<on     Storage:  HDFS   Resources:  YARN   Access:  Hive,  …     Pipeline:  Falcon   Cluster:  Knox  
  5. 5. Page 5 © Hortonworks Inc. 2014 HDP 2.1: Enterprise Hadoop HDP 2.1 Hortonworks Data Platform HDP 2.1 Hortonworks Data Platform     Provision,   Manage  &   Monitor     Ambari   Zookeeper   Scheduling     Oozie   Data  Workflow,   Lifecycle  &   Governance     Falcon   Sqoop   Flume   NFS   WebHDFS   DATA    MANAGEMENT   GOVERNANCE  &   INTEGRATION   OPERATIONS   Script     Pig       Search     Solr       NoSQL     HBase   Accumulo       Stream       Storm         Others     In-­‐Memory   AnalyCcs,     ISV  engines   1   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   N   HDFS     (Hadoop  Distributed  File  System)   Batch     Map   Reduce       SECURITY   Authen<ca<on   Authoriza<on   Accoun<ng   Data  Protec<on     Storage:  HDFS   Resources:  YARN   Access:  Hive,  …     Pipeline:  Falcon   Cluster:  Knox   YARN  :  Data  Opera<ng  System   DATA    ACCESS   SQL     Hive/Tez,   HCatalog      
  6. 6. Page 6 © Hortonworks Inc. 2014 Apache Hive After the Stinger Initiative: Speed, Scale & SQL Compliance
  7. 7. Page 7 © Hortonworks Inc. 2014 Hive: SQL Analytics For Any Data Size Sensor  Mobile   Weblog   OperaConal   /  MPP   Store  and  Query  all   Data  in  Hive   Use  Exis<ng  SQL  Tools   and  Exis<ng  SQL  Processes   SQL   Queries  
  8. 8. Page 8 © Hortonworks Inc. 2014 The Stinger Initiative: Complete • Community initiative around Hive • Enables Hive to support interactive workloads • Enhances Hive’s standard SQL interface for Hadoop • Improves existing tools & preserves investments Query Processing Vectorized Query Execution Engine Tez = 100X+ + File Format ORCFile
  9. 9. Page 9 © Hortonworks Inc. 2014 New in Hive HDP 2.1: Speed New Features for Speed Interactive query using Hive on Tez Vectorized query execution Cost-based optimizer
  10. 10. Page 10 © Hortonworks Inc. 2014 New in HDP 2.1: More Than 10 New SQL Features New SQL Features Subquery for IN / NOT IN Support for EXISTS and NOT EXISTS Common table expressions (CTEs) Support for CHAR datatype Scale and precision support for DECIMAL datatype JOIN conditions in the WHERE clause Cancel jobs via ODBC / JDBC Support for Unicode column names Permanent functions Stream data into Hive from Flume (Experimental feature)
  11. 11. Page 11 © Hortonworks Inc. 2014 Hive’s Journey to SQL Compliance Evolu<on  of  SQL  Compliance  in  Hive   SQL  Datatypes   SQL  SemanCcs   INT/TINYINT/SMALLINT/BIGINT   SELECT,  INSERT   FLOAT/DOUBLE   GROUP  BY,  ORDER  BY,  HAVING   BOOLEAN   JOIN  on  explicit  join  key   ARRAY,  MAP,  STRUCT,  UNION   Inner,  outer,  cross  and  semi  joins   STRING   Sub-­‐queries  in  the  FROM  clause   BINARY   ROLLUP  and  CUBE   TIMESTAMP   UNION   DECIMAL   Standard  aggregaCons  (sum,  avg,  etc.)   DATE   Custom  Java  UDFs   VARCHAR   Windowing  funcCons  (OVER,  RANK,  etc.)   CHAR   Advanced  UDFs  (ngram,  XPath,  URL)   Interval  Types   Sub-­‐queries  for  IN/NOT  IN,  HAVING   JOINs  in  WHERE  Clause   Common  Table  Expressions  (WITH  Clause)   INSERT  /  UPDATE  /  DELETE   Legend   Available   Roadmap   Hive  11   Hive  12   Hive  13  
  12. 12. Page 12 © Hortonworks Inc. 2014 New in HDP 2.1: Other Improvements Other New Hive Features SQL standard authorization Hive job visualizer in Ambari PAM authentication support SSL encryption support in HiveServer2 Dynamic partition scalability
  13. 13. Page 13 © Hortonworks Inc. 2014 Demo
  14. 14. Page 14 © Hortonworks Inc. 2014 FoodMart Dataset • FoodMart Dataset, replicated 275 times (~ 10GB data) • Queries run locally on an HDP 2.1 Sandbox. • Queries to do some customer analytics. sales_fact_1997 customer Other Dimension Tables time_by_day
  15. 15. Page 15 © Hortonworks Inc. 2014 Learn More About Hive & The Stinger Initiative Hortonworks.com/labs/stinger/ Register for the remaining 5 Discover HDP 2.1 Webinars Hortonworks.com/ webinars Next Webinar: Apache Falcon for Data Governance in Hadoop Wednesday, May 21, 10am Pacific

×