In Search of Database Nirvana: Challenges of Delivering HTAP

335 views

Published on

Rohit Jain (Esgyn)

Customers are looking for one database engine to address all their varied needs--from transactional to analytical workloads--against structured, semi-structured, and unstructured data (Gartner’s term Hybrid Transactional/Analytical Processing, or HTAP, perhaps comes closest to describing this nirvana.) But can it be achieved? The motivation of this talk is to establish a framework for assessing the maturity and capabilities of query engines on Apache Hadoop ecosystem storage engines such as HBase in meeting these diverse needs.

Published in: Software
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
335
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

In Search of Database Nirvana: Challenges of Delivering HTAP

  1. 1. In search of database nirvana The challenges of delivering Hybrid Transactional and Analytical Processing Rohit Jain, CTO rohit.jain@esgyn.com (C) Copyright 2015 Esgyn Corporation Esgyn Confidential
  2. 2. Agenda The swinging database pendulum Hybrid Transaction and Analytical Processing (HTAP) Workloads Query versus storage engines The challenges of HTAP ◦ Single query engine for all workloads ◦ Supporting multiple storage engines ◦ Same data model for all workloads ◦ Enterprise-caliber capabilities Conclusion (C) Copyright 2015 Esgyn Corporation Esgyn Confidential
  3. 3. The swinging database pendulum (C) Copyright 2015 Esgyn Corporation Esgyn Confidential RDBMS NoSQL • TCO • Elastic scalability • High performance • Semi-structured & unstructured data • Parallelization of user code • Schema flexibility • Modest needs Polyglot programming & persistence • graph database • document stores • text search • column stores • key value stores • wide column stores • Too many languages, interfaces, APIs, & data structures • Too much of gluing technologies together • Compatibility between different versions • No end-to-end view of workload performance • Support contracts with multiple vendors • Too many skills required to develop and manage • Too much data movement • No single solution for varied interfaces & use cases SQL • Skills prevalent • Existing tools & applications • Transaction support useful • More efficient when joins needed • Easier than coding M/R • Merit in rigor of pre-defining columns • Uniform metadata across applications
  4. 4. Hybrid Transaction and Analytical Processing (HTAP) Workloads (C) Copyright 2015 Esgyn Corporation Esgyn Confidential OLTP • Mostly transactional • Sub-second response • Customer experience • Large update volume • High concurrency • Scales linearly • Normalized data model • Custom applications or 3rd party solutions • Mostly SMP; MPP for web-scale • Keyed updates/queries ODS • Can be transactional • Sub-second to seconds • Customer experience or Business internal • Batch to streaming feeds from OLTP • Low update volume • Low concurrency if internal, high otherwise • Near linear scale • Historical data • Normalized data model • Custom apps / 3rd party • Keyed queries BI • Non-transactional • Seconds to minutes • Business internal • Batch to streaming feeds from OLTP/ODS • No direct updates • Low to high concurrency • Less linear in scale • Historical data • Dimension data model • BI tools – reporting & dashboards • Ad hoc & scheduled queries and large extracts Analytics • Non-transactional • Minutes to hours • Business internal • Batch/aggregates from BI • No direct updates • Low concurrency • Complex queries, non- linear scale • Historical & big data • Columnar store • Analytics in database • Analytical tools • Ad hoc queries Essential to operate the business To improve performance of the company
  5. 5. Query versus storage engines (C) Copyright 2015 Esgyn Corporation Esgyn Confidential Hadoop Cluster Switch Switch Operational Business Intelligence Analytics Query Engine • Allow clients to connect & submit queries • Distribute connections across cluster • Compile query • Execute query • Return results of query to client Storage Engine • Storage structure • Partitioning • Automatic data repartitioning • Select columns • Select rows based on predicates • Caching writes and reads • Clustering by key • Fast access paths or filtering • Transactional support • Replication • Compression & Encryption • Mixed workload support • Bulk data ingest/extract • Indexing • Colocation or node locality • Data Governance • Security • Disaster recovery • Backup, Archive, Restore • Multi-temperate data support In-memory Single Query Engine
  6. 6. The challenges of HTAP Single query engine for all workloads Data structure – key support, clustering, partitioning Statistics Predicates on non-leading or non-key columns Indexes and materialized views Degree of parallelism Reducing the search space Join type Data flow and access Mixed Workload Feature support (C) Copyright 2015 Esgyn Corporation Esgyn Confidential 80 minutes 2 minutes Equal-height histograms
  7. 7. The challenges of HTAP Single query engine for all workloads Data structure – key support, clustering, partitioning Statistics Predicates on non-leading or non-key columns Indexes and materialized views Degree of parallelism Reducing the search space Join type Data flow and access Mixed Workload Feature support (C) Copyright 2015 Esgyn Corporation Esgyn Confidential Week Item Store … 01/07/2016 1 1 … 01/07/2016 1 3 … 01/07/2016 1 5 … 01/07/2016 2 34 … 01/07/2016 3 13 … 01/07/2016 3 3 … 01/07/2016 4 2 … 01/07/2016 4 4 … 01/14/2016 1 2 … 01/14/2016 1 4 … 01/14/2016 1 5 … 01/14/2016 1 35 … 01/14/2016 3 1 … 01/14/2016 3 20 … Where is item = 1, Stores 2 through 5?
  8. 8. The challenges of HTAP Single query engine for all workloads Data structure – key support, clustering, partitioning Statistics Predicates on non-leading or non-key columns Indexes and materialized views Degree of parallelism Reducing the search space Join type Data flow and access Mixed Workload Feature support (C) Copyright 2015 Esgyn Corporation Esgyn Confidential Serial vs parallel plans Node 1 Node 2 Node n Client Application HDFS HBase Region 1 Filters HDFS HDFS HDFS HDFS Ethernet Coprocessors HBase Region 2 HBase Region 3 HBase Region 4 HBase Region 5 Master Master Multi- fragment Master ESP ESP ESP ESP ESP ESP ESP ESP ESP ESP
  9. 9. The challenges of HTAP Single query engine for all workloads Data structure – key support, clustering, partitioning Statistics Predicates on non-leading or non-key columns Indexes and materialized views Degree of parallelism Reducing the search space Join type Data flow and access Mixed Workload Feature support (C) Copyright 2015 Esgyn Corporation Esgyn Confidential Qry1 Qry2Qry4 Qry3Qry5 Qry6 Qry7
  10. 10. The challenges of HTAP Single query engine for all workloads Data structure – key support, clustering, partitioning Statistics Predicates on non-leading or non-key columns Indexes and materialized views Degree of parallelism Reducing the search space Join type Data flow and access Mixed Workload Feature support (C) Copyright 2015 Esgyn Corporation Esgyn Confidential Adaptive and parallel joins • Nested join • Probe cache for nested join • Merge join • Matching partition join • Repartitioned hash join • Replication by broadcast hash join • Inner / outer child broadcast • Dimensional schema star join • Inner join • Left Join • Right Join • Full Outer Join • Self join Cost Premiums for nested joins or serial plans
  11. 11. The challenges of HTAP Single query engine for all workloads Data structure – key support, clustering, partitioning Statistics Predicates on non-leading or non-key columns Indexes and materialized views Degree of parallelism Reducing the search space Join type Data flow and access Mixed Workload Feature support (C) Copyright 2015 Esgyn Corporation Esgyn Confidential Compute Cost Execution Environment Physical Properties Estimates Confidence Cardinality, Distribution, Correlation Sensitivity To Estimates Evaluate Risk Risk Adjustment Benefit Risk Risk Premiums • Nested join 20% • Merge join 10% • Serial plan 5%   ?
  12. 12. Data structure – key support, clustering, partitioning Statistics Predicates on non-leading or non-key columns Indexes and materialized views Degree of parallelism Reducing the search space Join type Data flow and access Mixed Workload Feature support • Priority / SLA based execution • Allocation of resources by service level • Decrease priority with usage increase • Anti-starvation / switch between queries based on priority The challenges of HTAP Single query engine for all workloads (C) Copyright 2015 Esgyn Corporation Esgyn Confidential Query Low Query Medium Queue Memstore HBase …. Memstore HBase Memstore HBase Queue Queue HBase Region 1 HBase Region 3 HBase Region 5 Query High Low Low Low Medium MediumMedium High HighHighLow Low Low Medium MediumMedium High HighHigh
  13. 13. The challenges of HTAP Supporting multiple storage engines Statistics Key structure Partitioning Data type support Projection and selection Extensibility Security enforcement Transaction Management Metadata support Performance, scale, and concurrency considerations Error handling Other operational aspects (C) Copyright 2015 Esgyn Corporation Esgyn Confidential Single-Master Multiple-Masters
  14. 14. The challenges of HTAP Same data model for all workloads (C) Copyright 2015 Esgyn Corporation Esgyn Confidential Normal form • 1NF • 2NF • 3NF • BCNF • 4NF • 5NF • 6NF Star Schema Snowflake Schema Normal Form Query engine integration with storage engine(s) to support all these data models
  15. 15. The challenges of HTAP Same data model for all workloads (C) Copyright 2015 Esgyn Corporation Esgyn Confidential NoSQL Data Models “NoSQL Data Modeling Techniques” by Ilya Katsov Highly Scalable Blog … and these!
  16. 16. The challenges of HTAP Enterprise-caliber capabilities High Availability Security Manageability (C) Copyright 2015 Esgyn Corporation Esgyn Confidential • Percentage of uptime 99.99% = 52.56 minutes downtime to 99.999% = 5.26 • Online operations (data available for reads and writes) o Upgrading the OS o Upgrading the file system o Upgrading the storage engine o Upgrading the query engine o Redistribute data to accommodate node and/or disk expansions and contractions o Changing table definition, e.g. data type changes, and adding, dropping, renaming columns o Create/drop secondary indexes o Full and Incremental Backups
  17. 17. The challenges of HTAP Enterprise-caliber capabilities High Availability Security Manageability (C) Copyright 2015 Esgyn Corporation Esgyn Confidential Schema Management Performance Management Monitoring Security Management BAR Management Object Management Performance Monitoring Database Monitor User Management Backup Analysis Graphical Object Editor Live Performance Monitoring Event Monitoring Role Management Recovery Cross-Platform Schema Knowledge Data Repository Live Event Monitoring Account Migration Log Backup Bottleneck Analysis Threshold Alerts Audit Report Backup Reports SQL Management Job/Workload Analysis Health Index Alarm Archival Query Builder Job/Workload Wizard Live Health Monitoring Visual Difference Tool Job/Workload Management Response Times Maintenance Configuration Management Data Management Live Job/Workload Monitoring Alert Center Repository Aging OS Provisioning Data Migration OS Analysis Remote Monitoring Automated Maintenance Cluster Provisioning SQL Profiler Capacity Capture Central Monitoring Instance Provisioning Automated Import Capacity Trending Hardware Inventory Change Management Cloud Provisioning Visual Explain Plans Capacity Forecast Hardware Monitoring Schema Capture Configuration Editor Session Management Space Management Schema Compare and Synch Lock Management Reorganization Management Troubleshooting Notifications Process Management Query Cost Simulation Health Analysis Schema Rotation Consistency Checks Historical Reports Problem Correlation Collaboration Online Schema Evolution Bottleneck Tuning Automated Actions Virtual Changes Built-In Automation Access Path Analysis
  18. 18. The challenges of HTAP Enterprise-caliber capabilities High Availability Security Manageability (C) Copyright 2015 Esgyn Corporation Esgyn Confidential • Operational performance by transactions per second • Analytical performance by query • Overhead of gathering metrics on operational and analytical workloads • Configurable statistics collection • Workload management by Service Level Objectives o Based on priority and/or resource allocation o High priority operational workloads vs analytical workloads • End-to-end visibility of transaction and query metrics • Metric breakdown down to the query operation • Metrics for table access across workloads down to the partition level • Skew or bottlenecks • Integration with YARN
  19. 19. Conclusion (C) Copyright 2015 Esgyn Corporation Esgyn Confidential Pre-register for full O’Reilly report: http://www.oreilly.com/go/dbnirvana It ain’t easy!! Very few products can even come close Any guesses? 

×