Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An Apache Hive Based Data Warehouse

6,097 views

Published on

Using Apache Hadoop and related technologies as a data warehouse has been an area of interest since the early days of Hadoop. In recent years Hive has made great strides towards enabling data warehousing by expanding its SQL coverage, adding transactions, and enabling sub-second queries with LLAP. But data warehousing requires more than a full powered SQL engine. Security, governance, data movement, workload management, monitoring, and user tools are required as well. These functions are being addressed by other Apache projects such as Ranger, Atlas, Falcon, Ambari, and Zeppelin. This talk will examine how these projects can be assembled to build a data warehousing solution. It will also discuss features and performance work going on in Hive and the other projects that will enable more data warehousing use cases. These include use cases like data ingestion using merge, support for OLAP cubing queries via Hive’s integration with Druid, expanded SQL coverage, replication of data between data warehouses, advanced access control options, data discovery, and user tools to manage, monitor, and query the warehouse.

Speaker
Alan Gates, Co-founder, Hortonworks

Published in: Technology
  • Real Money Streams ~ Create multiple streams of wealth from your home! ●●● http://ishbv.com/ezpayjobs/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Have you ever heard of taking paid surveys on the internet before? We have one right now that pays $50, and takes less than 10 minutes! If you want to take it, here is your personal link ➤➤ http://ishbv.com/surveys6/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Your opinions matter! get paid for them! click here for more info...●●● http://ishbv.com/surveys6/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

An Apache Hive Based Data Warehouse

  1. 1. Scalable Data Warehousing on Hadoop Alan F. Gates, Co-founder, Hortonworks
  2. 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What Do You Expect in a Hadoop Data Warehouse? Benchmarks focus on two questions: – How much of the TPC-DS query set can it run? – How fast can it run it?
  3. 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What You Expect in a Data Warehouse? High Performance SQL 2011 High Storage Capacity Security Support for BI, Cubes, Data Science Monitoring & Management Governance Data Lifecycle Management Replication & D/R Workload Management Data Ingestion
  4. 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved So, back to TPC-DS... High Performance SQL 2011
  5. 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Overview Apache Hive is a SQL data warehouse engine that delivers fast, scalable SQL processing on Hadoop and in the Cloud. Features: • Extensive SQL:2011 Support • ACID Transactions • In-Memory Caching • Cost-Based Optimizer • User-Based Dynamic Security • JDBC and ODBC Support • Compatible with every major BI Tool • Proven at 300+ PB Scale
  6. 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive: Fast Facts Most Queries Per Hour 100,000 Queries Per Hour (Yahoo Japan) Analytics Performance 100 Million rows/s Per Node (with Hive LLAP) Largest Hive Warehouse 300+ PB Raw Storage (Facebook) Largest Cluster 4,500+ Nodes (Yahoo)
  7. 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Types SQL Features File Formats Hive 2 Numeric Core SQL Features Columnar ACID MERGE FLOAT, DOUBLE Date, Time and Arithmetical Functions ORCFile Multi Subquery DECIMAL INNER, OUTER, CROSS and SEMI Joins Parquet Scalar Subqueries INT, TINYINT, SMALLINT, BIGINT Derived Table Subqueries Text Non-Equijoins BOOLEAN Correlated + Uncorrelated Subqueries CSV INTERSECT / EXCEPT String UNION ALL Logfile CHAR, VARCHAR UDFs, UDAFs, UDTFs Nested / Complex Recursive CTEs BLOB (BINARY), CLOB (String) Common Table Expressions Avro NOT NULL Constraints Date, Time UNION DISTINCT JSON Default Values DATE, TIMESTAMP, Interval Types Advanced Analytics XML Multi-statement Transactions Complex Types OLAP and Windowing Functions Custom Formats ARRAY / MAP / STRUCT / UNION OLAP: Partition, Order by UDAF Other Features Nested Data Analytics CUBE and Grouping Sets XPath Analytics Nested Data Traversal ACID Transactions Lateral Views INSERT / UPDATE / DELETE Procedural Extensions Constraints HPL/SQL Primary / Foreign Key (Non Validated) Apache Hive: Journey to SQL:2011 Analytics Legend New Future work Hive 2 Track Hive SQL:2011 Complete: HIVE-13554
  8. 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2 with LLAP: Architecture Overview Deep Storage YARN Cluster LLAP Daemon Query Executors LLAP Daemon Query Executors LLAP Daemon Query Executors LLAP Daemon Query Executors Query Coordinators Coord- inator Coord- inator Coord- inator HiveServer2 (Query Endpoint) ODBC / JDBC SQL Queries In-Memory Cache (Shared Across All Users) HDFS and Compatible S3 WASB Isilon
  9. 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 0 5 10 15 20 25 30 35 40 45 50 0 50 100 150 200 250 Speedup (x Factor) Query Time(s) (Lower is Better) Hive 2 with LLAP averages 26x faster than Hive 1 Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor) Hive 2 with LLAP: 25+x Performance Boost: Interactive / 1TB Scale
  10. 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive vs. Apache Impala at 10TB Ã 10TB scale on 10 identical AWS nodes. Ã Hive and Impala showed similar times on most smaller queries. Ã Hive scaled better, with many queries completing in <2m where Impala ran to timeout (3000s). Highlights
  11. 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive vs. Presto on a partitioned 1TB dataset. Ã Presto lacks basic performance optimizations like dynamic partition pruning. Ã On a real dataset / workload Presto perform poorly without full re-writes. Ã Example: Query 55 without re-writes = 185.17s, with re- writes = 16s. LLAP = 1.37s. Highlights
  12. 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive LLAP: Stable Performance under High Concurrency 4x Queries, 2.8x Runtime Difference 5x Queries, 4.6x Runtime Difference Mark Concurrent Queries Average Runtime 5 7.76s 25 36.24s 100 102.89s
  13. 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How Much Can it Hold, and Where? High Storage Capacity
  14. 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Storage à Of course HDFS, default in the Hadoop world à More and more cloud à Move is copy in S3, but current implementation assumes move is atomic and nearly free – modifying Hadoop (HADOOP-11694) and Hive (HIVE-14535) à ACID in the cloud – Compactor moves a lot of files around, need to optimize – Need to figure out how streaming ingest works in the cloud à LLAP, caching much more valuable in the cloud – Looking at flushing cache to SSD so misses are less costly
  15. 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Is My Data Safe? Security
  16. 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved • Wire encryption • HDFS encryption + Ranger KMS • Centralized audit reporting w/ Apache Ranger • Fine grain access control with Apache Ranger Security today in Hadoop Authorization What can I do? Audit What did I do? Data Protection Can data be encrypted at rest and over the wire? • Kerberos • API security with Apache Knox Authentication Who am I/prove it? Centralized Security Administration w/ Ranger & Knox
  17. 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Authentication—API Security with Knox • Eliminates SSH “edge node” • Central API management • Central audit control • Service level authorization • SSO - SAMLv2, Siteminder and OAM • LDAP and AD integration • SSO for Hadoop UIs (Ranger, Ambari..) Apache Knox extends the reach of Hadoop REST API without Kerberos complexities Integrated with existing IdM systems Single, simple point of access for a cluster Centralized and consistent secure API across one or more clusters • Kerberos Encapsulation • Single Hadoop access point • REST API hierarchy • Consolidated API calls • Multi-cluster support
  18. 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved LLAP Data Access User ID Region Total Spend 1 East 5,131 2 East 27,828 3 West 55,493 4 West 7,193 5 East 18,193 Apache Ranger: Per-User Row Filtering by Region in Hive User 2 (East Region) User 1 (West Region) Original Query: SELECT * from CUSTOMERS WHERE total_spend > 10000 Query Rewrites based on Dynamic Ranger Policies Dynamic Rewrite: SELECT * from CUSTOMERS WHERE total_spend > 10000 AND region = “east” Dynamic Rewrite: SELECT * from CUSTOMERS WHERE total_spend > 10000 AND region = “west”
  19. 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Ranger: Dynamic Data Masking of Hive Columns R A N G E R Protect Sensitive Data in real-time with Dynamic Data Masking/Obfuscation! Goal: Mask or anonymize sensitive columns of data (e.g. PII, PCI, PHI) from Hive query output ⬢ Benefits – Sensitive information never leaves database – No changes are required at the application or Hive layer – No need to produce additional protected duplicate versions of datasets – Simple & easy to setup masking policies ⬢ Core Technologies: Ranger, Hive AT L A S H I V E
  20. 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dynamic Tag-based Access Policies with Apache Atlas • Basic Tag policy – PII example. Access and entitlements must be tag based ABAC and scalable in implementation. • Geo-based policy – Policy based on IP address, proxy IP substitution maybe required. The rule enforcement must be geo aware. • Time-based policy – Timer for data access, de- coupled from deletion of data. • Prohibitions – Prevention of combination of Hive tables that may pose a risk together. Key Benefits: New scalable metadata based security paradigm Dynamic, real-time policy Active protection – fast updates to changes Centralized and simple to manage policy
  21. 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What’s There and Where Did It Come From? Governance
  22. 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sqoop Teradata Connector Apache Kafka Apache Atlas: Cross-Component Dataset Lineage Custom Activity Reporter Metadata Repository RDBMS Any process using Sqoop is covered No other tool tracks IOT out of the box
  23. 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Atlas Enables Business Catalog for Ease of Use à Organize data assets along business terms – Authoritative: Hierarchical Taxonomy Creation – Agile modeling: Model Conceptual, Logical, Physical assets – Definition and assignment of tags like PII (Personally Identifiable Information) à Comprehensive features for compliance – Multiple user profiles including Data Steward and Business Analysts – Object auditing to track “Who did it” – Metadata Versioning to track ”what did they do” à Faster Insight: – Data Quality tab for profiling and sampling – User Comments Key Benefits: Organize data assets along business terms Compliance Features Faster Insight
  24. 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How Will My Users Interact With It? Support for BI, Cubes, Data Science
  25. 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid: Deep Multidimensional Analytics Real-Time Analytics Hive / Spark BI Tools REST API Superset UI Events Logs Trans- actions Sensors Historical Sources HDFS S3 Druid Data Cubes Ultra-Fast Analytics Slice-and-Dice Streaming Sources Storm Kafka Spark Deep, Fast Drilldown Across Any Dimension Scalably Ingest Historical Data from Transactional and Web Systems = Future
  26. 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid’s Role in Scalable Data Warehousing UI Core Platform S3 or HDFS HiveServer2 MDX Unified SQL and MDX Layer SQL BI Tools MDX Tools Hive Realtime Feeds (Kafka, Storm, etc.) Druid OLAP Indexes HiveServer2 Hive SQL Thrift Server SparkSQL Fast SQL MDX Superset UI Fast Exploration Ranger Atlas Ambari Management
  27. 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Analytics at Scale with No Data Movement Syncsort High-Performance Data Movement Hadoop Scalable Storage and Compute Hive LLAP High Performance SQL AtScale Intelligence Platform OLAP Cubes for Higher Performance Source Data Systems Fast, scalable SQL analytics Intelligent in-memory caching Define OLAP cubes for 10x faster queries Unified semantic layer for all BI tools High performance data import from all major EDW platforms Pre-aggregated data ... Or, full-fidelity data
  28. 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Column Security with LLAP Ã Fine-Grained Column Level Access Control for SparkSQL. Ã Fully dynamic policies per user. Doesn’t require views. Ã Use Standard Ranger policies and tools to control access and masking policies. Flow: 1. SparkSQL gets data locations known as “splits” from HiveServer and plans query. 2. HiveServer2 authorizes access using Ranger. Per-user policies like row filtering are applied. 3. Spark gets a modified query plan based on dynamic security policy. 4. Spark reads data from LLAP. Filtering / masking guaranteed by LLAP server. HiveServer2 Authorization Hive Metastore Data Locations View Definitions LLAP Data Read Filter Pushdown Ranger Server Dynamic Policies Spark Client 1 2 4 3
  29. 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin, Attaches to Hive and Spark
  30. 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved But Wait, There’s More Monitoring & Management Data Lifecycle Management Replication & D/R Data Ingestion
  31. 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scalable Data Warehousing on Hadoop Capabilities Batch SQL OLAP / CubeInteractive SQL Sub-Second SQL ACID / MERGE Applications • ETL • Reporting • Data Mining • Deep Analytics • Multidimensional Analytics • MDX Tools • Excel • Reporting • BI Tools: Tableau, Microstrategy, Cognos • Ad-Hoc • Drill-Down • BI Tools: Tableau, Excel • Continuous Ingestion from Operational DBMS • Slowly Changing Dimensions Existing Development Emerging Legend Core Platform Scale-Out Storage Petabyte Scale Processing Core SQL Engine Apache Tez: Scalable Distributed Processing Advanced Cost-Based Optimizer Connectivity Advanced Security JDBC / ODBC Comprehensive SQL:2011 Coverage MDX
  32. 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved For More Details à Today – Running Zeppelin in Enterprise – 3:10 – Dancing Elephants – Efficiently Working with Object Stores from Apache Spark and Hive – 4:20 – Open Metadata and Governance with Apache Atlas – 5:10 – LLAP: Building Cloud First BI – 5:50pm à Tomorrow – Interactive Analytics At Scale in Apache Hive Using Druid – 9:00 – Disaster Recovery and Cloud Migration for you Apache Hive Warehouse – 11:00 – LLAP: Building Cloud-First BI – 11:50 – Treat Your Enterprise Data Lake Indigestion: Enterprise Ready Security and Governance – 3:10 – Birds of a Feather Session for Hive and HBase – 6:00

×