Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next

4,701 views

Published on

Earlier this year, the Apache open source community delivered the Stinger Initiative to improve speed, scale and SQL semantics in Apache Hive. Now Stinger.next is underway, to build on those initial successes.

In this presentation, from a webinar hosted by Hortonworks co-founder Alan Gates and Hortonworks Hive product manager Raj Baines, you can learn more about Stinger.next and innovation in Apache Hive.

Alan and Raj cover new Hive functionality for more speed, scale and SQL in HDP 2.2. Specific topics include transactions with ACID semantics, the cost based optimizer and dynamic query optimizations.

The presentation also shows future plans for the Stinger.next initiative.

Published in: Software

Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next

  1. 1. Discover HDP 2.2: Even Faster SQL Queries with Apache Hive & Stinger.next Page 1 © Hortonworks Inc. 2014 Hortonworks. We do Hadoop.
  2. 2. Speakers Page 2 © Hortonworks Inc. 2014 Justin Sears Hortonworks Product Marketing Manager Alan Gates Hortonworks Co-Founder and Apache Hive Committer & PMC Member Raj Bains Hortonworks Senior Manger of Product Management for Apache Hive
  3. 3. Agenda • Introduction to Stinger.next • New Innovation in Apache Hive 0.14 § SQL: Transactions with ACID semantics § Speed: Cost based optimizer for star and bushy joins § Scale: Dynamic query optimizations • The Road Ahead for Stinger.next • Q & A We’ll move quickly: • Attendee phone lines are muted • Text any questions to Raj Bains using Webex chat • Questions answered at the end • Unanswered questions and answers in upcoming blog post Page 3 © Hortonworks Inc. 2014
  4. 4. Big Data, Hadoop & Data Center Re-platforming Business Drivers • From reactive analytics to proactive interactions • Insights that drive competitive advantage & optimal returns Page 4 © Hortonworks Inc. 2014 $ Financial Drivers • Cost of data systems, as % of IT spend, continues to grow • Cost advantages of commodity hardware & open source software Technical Drivers • Data is growing exponentially & existing systems overwhelmed • Predominantly driven by NEW types of data that can inform analytics There is an inequitable balance between vendor and customer in the market
  5. 5. Clickstream Capture and analyze website visitors’ data trails and optimize your website Page 5 © Hortonworks Inc. 2014 Sensors Discover patterns in data streaming automatically from remote sensors and machines Server Logs Research logs to diagnose process failures and prevent security breaches Hadoop Value: New Types of Data Sentiment Understand how your customers feel about your brand and products – right now Geographic Analyze location-based data to manage operations where they occur Unstructured Understand patterns in files across millions of web pages, emails, and documents
  6. 6. A Shift from Reactive to Proactive Interactions A shift in Advertising From mass branding …to 1x1 Targeting A shift in Financial Services From Educated Investing …to Automated Algorithms A shift in Healthcare From mass treatment …to Designer Medicine A shift in Retail A shift in Telco Page 6 © Hortonworks Inc. 2014 HDP and Hadoop allow organizations to use data to shift interactions from… Reactive Post Transaction Proactive Pre Decision …to Real-t From static branding ime Personalization From break then fix …to repair before break
  7. 7. Enterprise Goals for the Modern Data Architecture Batch Interactive Real-Time Page 7 © Hortonworks Inc. 2014 • Consolidate siloed data sets structured and unstructured • Central data set on a single cluster • Multiple workloads across batch interactive and real time • Central services for security, governance and operation • Preserve existing investment in current tools and platforms • Single view of the customer, product, supply chain DATA SYSTEM APPLICATIONS Business Analytics Custom Applications Packaged Applications RDBMS EDW MPP YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N CRM ERP Other 1 ° ° ° ° ° ° HDFS (Hadoop Distributed File System) SOURCES EXISTING Systems Clickstream Web &Social Geoloca9on Sensor & Machine Server Logs Unstructured
  8. 8. YARN Transformed Hadoop & Opened a New Era Script Pig BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SQL Hive TezTez Page 8 © Hortonworks Inc. 2014 YARN The Architectural Center of Hadoop • Common data platform, many applications • Support multi-tenant access & processing • Batch, interactive & real-time use cases Java Scala Cascading Tez Stream Storm YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) Search Solr NoSQL HBase Accumulo Sli der Slider In-Memory Spark
  9. 9. YARN Extends Hadoop to Other Data Center Leaders Script Pig BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SQL Hive TezTez Java Scala Cascading Tez NoSQL HBase Accumulo Sli der 1 ° ° ° ° ° ° ° Stream Storm Slider HDFS In-Memory Spark (Hadoop Distributed File System) ° ° ° ° ° ° ° ° Page 9 © Hortonworks Inc. 2014 YARN The Architectural Center of Hadoop • Common data platform, many applications • Support multi-tenant access & processing • Batch, interactive & real-time use cases • Supports 3rd-party ISV tools (ex. SAS, Syncsort, Actian, etc.) YARN: Data Operating System (Cluster Resource Management) ° ° ° ° Others ISV Engines Search Solr ° ° ° ° ° ° ° ° ° ° YARN Ready Applications Facilitates ongoing innovation and enterprise adoption via ecosystem of new and existing “YARN Ready” solutions
  10. 10. Enterprise Hadoop: Central Set of Services BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE SECURITY OPERATIONS Tez TezTez Page 10 © Hortonworks Inc. 2014 Slider Slider YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Enables Apache Hadoop to be an Enterprise Data Platform with centralized services for: • Governance • Operations • Security Everything that plugs into Hadoop inherits these services Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Load data and manage according to policy Deploy and effectively manage the platform Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection Script Pig SQL Hive Java Scala Cascading Stream Storm Search Solr NoSQL HBase Accumulo In-Memory Spark Others ISV Engines HDFS (Hadoop Distributed File System)
  11. 11. Hortonworks Development Investment for the Enterprise Vertical Integration with YARN and HDFS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE SECURITY OPERATIONS Tez TezTez Slider 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Page 11 © Hortonworks Inc. 2014 Slider ° ° ° ° ° ° ° ° ° ° ° ° ° ° Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Load data and manage according to policy Deploy and effectively manage the platform Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection Script Pig SQL Hive Java Scala Cascading Stream Storm Search Solr NoSQL HBase Accumulo In-Memory Spark Others ISV Engines YARN: Data Operating System (Cluster Resource Management) HDFS (Hadoop Distributed File System) • Ensure engines can run reliably and respectfully in a YARN based cluster • Implement features throughout the stack to accommodate
  12. 12. Hortonworks Development Investment for the Enterprise Horizontal Integration for Enterprise Services BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE SECURITY OPERATIONS Tez TezTez Slider 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Page 12 © Hortonworks Inc. 2014 Slider ° ° ° ° ° ° ° ° ° ° ° ° ° ° Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Load data and manage according to policy Deploy and effectively manage the platform Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection Script Pig SQL Hive Java Scala Cascading Stream Storm Search Solr NoSQL HBase Accumulo In-Memory Spark Others ISV Engines YARN: Data Operating System (Cluster Resource Management) HDFS (Hadoop Distributed File System) • Ensure consistent enterprise services are applied across the entire Hadoop stack • Integrate with and extend existing data center solutions for these key requirements
  13. 13. HDP Delivers Enterprise Hadoop Hortonworks Data Platform 2.2 GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SECURITY OPERATIONS Script Pig SQL Hive TezTez Page 13 © Hortonworks Inc. 2014 Java Scala Cascading Tez Stream Storm YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) Search Solr NoSQL HBase Accumulo Sli der Slider In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS Authentication Authorization Audit Data Protection Storage: HDFS Resources: YARN Access: Hive Pipeline: Falcon Cluster: Ranger Cluster: Knox Linux Windows Deployment Choice Cloud YARN is the architectural center of HDP • Common data set across all applications • Batch, interactive & real-time workloads • Multi-tenant access & processing Provides comprehensive enterprise capabilities • Governance • Security • Operations Enables broad ecosystem adoption • ISVs can plug directly into Hadoop The widest range of deployment options • Linux & Windows • On premises & cloud Others ISV Engines On-Premises
  14. 14. HDP Delivers Enterprise Hadoop Hortonworks Data Platform 2.2 GOVERNANCE OPERATIONS Script Pig Tez BATCH, INTERACTIVE & REAL-TIME DATA ACCESS YARN: Data Operating System (Cluster Resource Management) SQL Hive Tez 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Page 14 © Hortonworks Inc. 2014 Java Scala Cascading Tez Stream Storm ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) Search Solr NoSQL HBase Accumulo Sli der Slider In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS YARN is the architectural center of HDP • Common data set across all applications • Batch, interactive & real-time workloads • Multi-tenant access & processing Provides comprehensive enterprise capabilities • Governance • Security • Operations Enables broad ecosystem adoption • ISVs can plug directly into Hadoop The widest range of deployment options • Linux & Windows • On premises & cloud Others ISV Engines SECURITY Authentication Authorization Audit Data Protection Storage: HDFS Resources: YARN Access: Hive Pipeline: Falcon Cluster: Ranger Cluster: Knox Linux Windows Deployment Choice On-Premises Cloud
  15. 15. Introduction to Stinger.next Page 15 © Hortonworks Inc. 2014
  16. 16. Stinger.next – Enterprise SQL at Hadoop Scale Stinger (Hive 0.13, Tez, ORC File) Scale to Petabytes Batch to Interactive Queries Read-Only Data Substantial SQL Support Single Tool for Multiple SQL workloads – Interactive, Reporting and ETL MapReduce, Tez Engines Page 16 © Hortonworks Inc. 2014 Stinger.next Scale to Petabytes Sub-Second Queries Modify Data with Transactions Comprehensive SQL:2011 Analytics Single Tool for Multiple SQL workloads – Interactive, Reporting, ETL, ML MapReduce, Tez, Spark Engines
  17. 17. SQL in Hive 0.14: Transactions with ACID Semantics Page 17 © Hortonworks Inc. 2014
  18. 18. Transaction Use Cases Reporting with Analytics (YES) • Reporting on data with occasional updates • Corrections to the fact tables, evolving dimension tables • Low concurrency updates, low TPS Operational Reporting (YES, next) • High throughput ingest from operational (OLTP) database • Periodic inserts every 5-30 minutes • Requires tool support and changes in our Transactions Operational (OLTP) Database (NO) • Small Transactions, each doing single line inserts • High Concurrency - Hundreds to thousands of connections Page 18 © Hortonworks Inc. 2014 Analytics Modifications Hive OLTP Replication Hive Hive High Concurrency OLTP
  19. 19. Deep Dive: Transactions Transaction Support in Hive with ACID semantics • Hive native support for INSERT, UPDATE, DELETE. • Split Into Phases: • Phase 1: Hive Streaming Ingest (append) • Phase 2: INSERT / UPDATE / DELETE Support • Phase 3: BEGIN / COMMIT / ROLLBACK Txn [Done] [HDP 2.2] [Next] Page 19 © Hortonworks Inc. 2014 Read- Optimized ORCFile Hive ACID Compactor periodically merges the delta files in the background. Delta File Merged Read- Optimized ORCFile 1. Original File Task reads the latest ORCFile Task Read- Optimized ORCFile Task Task 2. Edits Made Task reads the ORCFile and merges the delta file with the edits 3. Edits Merged Task reads the updated ORCFile
  20. 20. Speed in Hive 0.14: Cost Based Optimizer Page 20 © Hortonworks Inc. 2014
  21. 21. TPC-DS Query 17 Page 21 © Hortonworks Inc. 2014 SELECT i_item_id, i_item_desc, s_state, Count(ss_quantity) AS store_sales_quantitycount, Avg(ss_quantity) AS store_sales_quantityave, Stddev_samp(ss_quantity) AS store_sales_quantitystdev, Stddev_samp(ss_quantity) / Avg(ss_quantity) AS store_sales_quantitycov, Count(sr_return_quantity) as_store_returns_quantitycount, Avg(sr_return_quantity) as_store_returns_quantityave, Stddev_samp(sr_return_quantity) as_store_returns_quantitystdev, Stddev_samp(sr_return_quantity) / Avg(sr_return_quantity) AS store_returns_quantitycov, Count(cs_quantity) AS catalog_sales_quantitycount, Avg(cs_quantity) AS catalog_sales_quantityave, Stddev_samp(cs_quantity) / Avg(cs_quantity) AS catalog_sales_quantitystdev, Stddev_samp(cs_quantity) / Avg(cs_quantity) AS catalog_sales_quantitycov FROM store_sales, store_returns, catalog_sales, date_dim d1, date_dim d2, date_dim d3, store, item WHERE d1.d_quarter_name = '2000Q1' AND d1.d_date_sk = store_sales.ss_sold_date_sk AND ss_sold_date BETWEEN '2000-01-01' AND '2000-03-31' AND item.i_item_sk = store_sales.ss_item_sk AND store.s_store_sk = store_sales.ss_store_sk AND store_sales.ss_customer_sk = store_returns.sr_customer_sk AND store_sales.ss_item_sk = store_returns.sr_item_sk AND store_sales.ss_ticket_number = store_returns.sr_ticket_number AND store_returns.sr_returned_date_sk = d2.d_date_sk AND d2.d_quarter_name IN ( '2000Q1', '2000Q2', '2000Q3' ) AND sr_returned_date BETWEEN '2000-01-01' AND '2000-09-01' AND store_returns.sr_customer_sk = catalog_sales.cs_bill_customer_sk AND store_returns.sr_item_sk = catalog_sales.cs_item_sk AND catalog_sales.cs_sold_date_sk = d3.d_date_sk AND d3.d_quarter_name IN ( '2000Q1', '2000Q2', '2000Q3' ) AND cs_sold_date BETWEEN '2000-01-01' AND '2000-09-31' GROUP BY i_item_id, i_item_desc, s_state ORDER BY i_item_id, i_item_desc, s_state LIMIT 100;
  22. 22. CBO on Selected Queries – 17 Filter: date Filter: date Filter: date store_sales store_returns catalog_sales date_dim d1 date_dim d2 date_dim d3 Filter: quarter Filter: quarter Filter: quarter items store Page 22 © Hortonworks Inc. 2014 customer_sk ticket_number customer_sk Item_sk date_sk date_sk date_sk item_sk store_sk
  23. 23. OLD: Left Deep Plan Map 12 Table_scan Store_returns Reducer 10 Merge join 12, 9 Reducer 3 • Merge join 2 & 10 • Map join 1 • Map join 6 • Map Join 7 • Map Join 8 store • Map Join 11 item • Filter • Group By • Reduce Page 23 © Hortonworks Inc. 2014 Map 2 Table_scan catalog_sales Map 6 Table_scan d2, filter Map 7 Table_scan d3, filter Reducer 4 Group_By Reduce Map 9 Table_scan store_sales Map 1 Table_scan d1, filter Reducer 5 Limit B B B Map 8 Table_scan store B Map 11 Table_scan item Large Fact tables joined together without filters B
  24. 24. NEW: Complex Bushy Plan Page 24 © Hortonworks Inc. 2014 Reducer 4 Merge join 3 & 8 Map join store Map join item Reduce Map 10 table_scan store Map 12 Table_scan item Map 3 Store_sales Map join Map 8 Store_returns Map join Reducer 5 Merge_Join Group_By Reduce Map 9 Table_scan d1, filter Map 11 catalog_sales, Map Join Map 1 Table_scan d1, filter Map 2 Table_scan d1, filter Reducer 6 Group by Reduce Reducer7 Limit B B B B B All 3 Large Fact tables joined with date dimension limiting data to few quarters
  25. 25. Performance Improvement – Query 17 Scale = 30TB Input records ~186mil Page 25 © Hortonworks Inc. 2014 CBO Elapsed Time (sec) Elapsed Time Intermediate data (GB) Output and Intermediate Records OFF 10,683 ~3 hrs 5,017 135,647,792,123 ON 1,284 ~20 mins 275 8,543,232,360
  26. 26. Scale in Hive 0.14: Dynamic Query Optimization Page 26 © Hortonworks Inc. 2014
  27. 27. Auto Reducer Parallelism Use dynamic data volume during execution rather than estimates from query compilation to determine the number of reducers Leads to faster query execution, better resource utilizations Page 27 © Hortonworks Inc. 2014 Vertex Manager Vertex State Machine App Master Time 1. Data size statistics Tasks for a single map vertex Tasks for a single reduce vertex 2. Set parallelism 3. Re-route 4. Cancel task Vertex Manager Vertex State Machine App Master 5. Tasks Completed Tasks for a single map vertex Tasks for a single reduce vertex 6. Start Tasks 7. Start
  28. 28. Auto Reducer Parallelism use tpcds_bin_partitioned_orc_30000; set hive.tez.auto.reducer.parallelism=true; set hive.tez.min.partition.factor=0.125; SELECT ss_promo_sk, Sum(ss_sales_price), Count(*) FROM store_sales WHERE ss_sold_date < '1998-03-01' GROUP BY ss_promo_sk ORDER BY 2 DESC LIMIT 10; Page 28 © Hortonworks Inc. 2014
  29. 29. Dynamic Partition Pruning Table Definition create table store_sales (...) partitioned by (ss_sold_date_sk int) stored as orc; Example Join of • a large Fact table with multiple partitions • with a dimension table that has a filter Page 29 © Hortonworks Inc. 2014 store_sales d1 d2 d3 d4 … ss_sold_date_sk = date_sk Filter date_dim d1 The ss_sold_date_sk partitions that can be pruned away at join time is not known till the filter is applied at runtime Compile Time Design • Insert synthetic conditions for each join representing "x in (keys of other side in join)”. Optimizer will push it as far down as possible • If the condition hits a table scan and the column involved is a partition column: • Setup Operator to send key events to AM • else: • Remove synthetic predicate 1. Send events for partition pruning Vertex Manager Vertex State Machine App Master Tasks for a single map vertex Tasks for a single map vertex
  30. 30. Dynamic Pruning TPC-DS Query 3 SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand brand, Sum(ss_ext_sales_price) sum_agg FROM date_dim dt, store_sales, item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND item.i_manufact_id = 436 AND dt.d_moy = 12 GROUP BY dt.d_year, item.i_brand, item.i_brand_id ORDER BY dt.d_year, sum_agg DESC, brand_id LIMIT 100; Page 30 © Hortonworks Inc. 2014
  31. 31. Stinger.next: The Road Ahead Page 31 © Hortonworks Inc. 2014
  32. 32. Stinger.next - Delivery Themes Beyond Read-­‐Only 2nd Half 2014 • Transac(ons with ACID allowing insert, update and delete • Temporary Tables • Cost Based Op(mizer op(mizes star and bushy join queries Page 32 © Hortonworks Inc. 2014 Sub-­‐Second 1st Half 2015 • Sub-­‐Second queries with LLAP • Hive-­‐Spark Machine Learning integra(on • Opera(onal repor(ng with Hive Streaming Ingest and Transac(ons • Replica(on and SQL/CBO improvements Richer Analy9cs 2nd Half 2015 • Toward SQL:2011 Analy(cs • Materialized Views • Cross-­‐Geo Queries • Workload Management via YARN and LLAP integra(on
  33. 33. Q & A Page 33 © Hortonworks Inc. 2014
  34. 34. Thank you! Learn more at: hortonworks.com/hadoop/hive/ Page 34 © Hortonworks Inc. 2014 Register for the remaining 6 Discover HDP 2.2 Webinars Hortonworks.com/webinars

×