Advertisement

Apache phoenix

Jul. 1, 2021
Advertisement

More Related Content

Advertisement

Recently uploaded(20)

Apache phoenix

  1. Apache Phoenix Put the SQL back in NoSQL 1 Osama Hussein, March 2021
  2. Agenda ● History ● Overview ● Architecture 2 ● Capabilities ● Code ● Scenarios
  3. 1. History From open-source repo to top Apache project
  4. Overview (Apache Phoenix) 4 ● Began as an internal project by the company (salesforce.com). MAY 2014 JAN 2014 A Top-Level Apache Project Orignially Open- Sourced on Github
  5. 2. Overview UDF, Transactions and Schema
  6. Overview (Apache Phoenix) 6 Lorem ipsum congue tempus Support for late-bound, schema-on- read SQL and JDBC API support Access to data stored and produced in other components such as Apache Spark and Apache Hive ● Developed as part of Apache Hadoop. ● Runs on top of Hadoop Distributed File System (HDFS). ● HBase scales linearly and shards automatically.
  7. Overview (Apache Phoenix) 7 Lorem ipsum congue tempus Support for late-bound, schema-on- read SQL and JDBC API support Access to data stored and produced in other components such as Apache Spark and Apache Hive ● Apache Phoenix is an add-on for Apache HBase that provides a programmatic ANSI SQL interface. ● implements best-practice optimizations to enable software engineers to develop next-generation data-driven applications based on HBase. ● Create and interact with tables in the form of typical DDL/DML statements using the standard JDBC API.
  8. Overview (Apache Phoenix) 8 ● Written in Java and SQL ● Atomicity, Consistency, Isolation and Durability (ACID) ● Fully integrated with other Hadoop products such as Spark, Hive, Pig, Flume, and Map Reduce.
  9. Overview (Apache Phoenix) 9 ● included in ○ Cloudera Data Platform 7.0 and above. ○ Hortonworks distribution for HDP 2.1 and above. ○ Available as part of Cloudera labs. ○ Part of the Hadoop ecosystem.
  10. Overview (SQL Support) 10 ● Compiles SQL to and orchestrate running of HBase scans. ● Produces JDBC result set. ● All standard SQL query constructs are supported.
  11. Overview (SQL Support) 11 ● Direct use of the HBase API, along with coprocessors and custom filters. Performance: ○ Milliseconds for small queries ○ Seconds for tens of millions of rows.
  12. Overview (Bulk Loading) 12 ● MapReduce-based : ○ CSV and JSON ○ Via Phoenix ○ MapReduce library ● Single-Threaded: ○ CSV ○ Via PostgreSQL (PSQL) ○ HBase on local machine
  13. Overview (User Defintion Functions) 13 ● Temporary UDFs for sessions only. ● Permanent UDFs stored in system functions. ● UDF used in SQL and indexes. ● Tenant specific UDF usage and support. ● UDF jar update require cluster bounce.
  14. Overview (Transactions) 14 ● Using Apache Tephra cross row/table/ACID support. ● Create tables with flag ‘transactional=true’. ● Enable transactions and snapshot directory and set timeout value ‘hbase-site.xml’. ● Transactions start with statement against table. ● Transactions end with commit or rollback.
  15. Overview (Transactions) 15 ● Applications let HBase manage timestamps. ● Incase the application needs to control the timestamp ‘CurrentSCN’ property must be specified at the connection time. ● ‘CurrentSCN’ controls the timestamp for any DDL, DML, or query.
  16. Overview (Schema) 16 ● The table metadata is stored in versioned HBase table (Up to 1000 versions). ● ‘UPDATE_CACHE_FREQUENCY’ allow the user to declare how often the server will be checked for meta data updates. Values: ○ Always ○ Never ○ Millisecond value
  17. Overview (Schema) 17 ● Phoenix table can be: ○ Built from scratch. ○ Mapped to an existing HBase table. ■ Read-Write Table ■ Read-Only View
  18. Overview (Schema) 18  Read-Write Table: ○ column families will be created automatically if they don’t already exist. ○ An empty key value will be added to the first column family of each existing row to minimize the size of the projection for queries.
  19. Overview (Schema) 19  Read-Only View: ○ All column families must already exist. ○ Addition of the Phoenix coprocessors used for query processing (Only change to HBase table).
  20. 3. Architecture Architecture, Phoenix Data Mode, Query Execution and Enviroment
  21. Architecture 21
  22. Architecture 22
  23. Architecture (Phoenix Data Model) 23
  24. Architecture (Server Metrics Example) 24
  25. Architecture (Server Metrics Example) 25 ● Example:
  26. 26 Overlay Row Key Query Perform Merge Sort Skip Filtering Scan Interception Execute Scan Perform Final Merge Sort Intercept Scan in Coprocessor Filter using Skip Scan Execute Parallel Scans Overlay Row Key Ranges with Regions Identify Row Key Ranges from Query Architecture (Query Execution)
  27. Architecture (Enviroment) 27 Data Warehouse Extract, Transform, Load (ETL) BI and Visualizing
  28. 4. Code Commands and Sample Codes
  29. Code (Commands) 29 ● DML Commands: ○ UPSERT VALUES ○ UPSERT SELECT ○ DELETE ● DDL Commands: ○ CREATE TABLE ○ CREATE VIEW ○ Drop Table ○ Drop View
  30. 30 Connection: ● Long Running ● Short Running Connection conn = DriverManager.getConnection(“jdbc:phoenix:my_server:longRunning”, longRunningProps); Connection conn = DriverManager.getConnection("jdbc:phoenix:my_server:shortRunning", shortRunningProps);
  31. 31 @Test public void createTable() throws Exception { String tableName = generateUniqueName(); long numSaltBuckets = 6; String ddl = "CREATE TABLE " + tableName + " (K VARCHAR NOT NULL PRIMARY KEY, V VARCHAR)" + " SALT_BUCKETS = " + numSaltBuckets; Connection conn = DriverManager.getConnection(getUrl()); conn.createStatement().execute(ddl); } Transactions: ● Create Table
  32. 32 @Test public void readTable() throws Exception { String tableName = generateUniqueName(); long numSaltBuckets = 6; long numRows = 1000; long numExpectedTasks = numSaltBuckets; insertRowsInTable(tableName, numRows); String query = "SELECT * FROM " + tableName; Statement stmt = conn.createStatement(); ResultSet rs = stmt.executeQuery(query); PhoenixResultSet resultSetBeingTested = rs.unwrap(PhoenixResultSet.class); changeInternalStateForTesting(resultSetBeingTested); while (resultSetBeingTested.next()) {} resultSetBeingTested.close(); Set<String> expectedTableNames = Sets.newHashSet(tableName); assertReadMetricValuesForSelectSql(Lists.newArrayList(numRows), Lists.newArrayList(numExpectedTasks), resultSetBeingTested, expectedTableNames); } Transactions: ● Read Table
  33. 33 @Override public void getRowCount(ResultSet resultSet) throws SQLException { Tuple row = resultSet.unwrap(PhoenixResultSet.class).getCurrentRow(); Cell kv = row.getValue(0); ImmutableBytesWritable tmpPtr = new ImmutableBytesWritable(kv.getValueArray(), kv.getValueOffset(), kv.getValueLength()); // A single Cell will be returned with the count(*) - we decode that here rowCount = PLong.INSTANCE.getCodec().decodeLong(tmpPtr, SortOrder.getDefault()); } Transactions: ● Row Count
  34. 34 private void changeInternalStateForTesting(PhoenixResultSet rs) { // get and set the internal state for testing purposes. ReadMetricQueue testMetricsQueue = new TestReadMetricsQueue(LogLevel.OFF,true); StatementContext ctx = (StatementContext)Whitebox.getInternalState(rs, "context"); Whitebox.setInternalState(ctx, "readMetricsQueue", testMetricsQueue); Whitebox.setInternalState(rs, "readMetricsQueue", testMetricsQueue); } Transactions: ● Internal State
  35. 5. Capabilities Features and Capabilities
  36. Capabilities ● Overlays on top of HBase Data Model ● Keeps Versioned Schema Respository ● Query Processor 36
  37. Capabilities ● Cost-based query optimizer. ● Enhance existing statistics collection. ● Generate histograms to drive query optimization decisions and join ordering. 37
  38. Capabilities ● Secondary indexes: ● Boost the speed of queries without relying on specific row-key designs. ● Enable users to use star schemes. ● Leverage SQL tools and Online Analytics 38
  39. Capabilities ● Row timestamp column. ● Set minimum and maximum time range for scans. ● Improves performance especially when querying the tail-end of the data. 39
  40. 5. Scenarios Use Cases
  41. Scenarios (Server Metrics Example) 41
  42. SELECT substr(host,1,3), trunc(date,’DAY’), avg(response_time) FROM server_metrics WHERE date > CURRENT_DATE() – 7 AND substr(host, 1, 3) IN (‘sf1’, ‘sf3, ‘sf7’) GROUP BY substr(host, 1, 3), trunc(date,’DAY’) 42 Scenarios (Chart Response Time Per Cluster)
  43. SELECT host, date, gc_time FROM server_metrics WHERE date > CURRENT_DATE() – 7 AND substr(host, 1, 3) IN (‘sf1’, ‘sf3, ‘sf7’) ORDER BY gc_time DESC LIMIT 5 43 Scenarios (Find 5 Longest GC Times )
  44. Thanks! Any questions? You can find me at: Github: @sxaxmz Linkedin: linkedin.com/in/husseinosama 44

Editor's Notes

  1. Apache Phoenix -> A scale-out RDBMS with evolutionary schema built on Apache HBase
  2. Internal project out of a need to support a higher level, well understood, SQL language.
  3. Apache HBase -> open-source non-relational distributed database modeled after Google's Bigtable and written in Java. Used to have random, real-time read/write access to Big Data. column-oriented, NoSQL database built on top of Hadoop.
  4. Apache Phoenix -> Open source massively parallel relational database engine supporting database for Online Transactional Processing (OLTP) and operational analytics in Hadoop. Provides JDB browser enabling users to create, delete and alter SQL tables, view instances indexes and querying data through SQL. Apache phoenix is a relational layer over Hbase. SQL skin for Hbase. Provides a JDBC driver that hides the intricacies of the noSQL
  5.  ACID is a set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps. All changes to data are performed as if they are a single operation. 1. Atomicity preserves the “completeness” of the business process (all or nothing behavior) 2. Consistency refers to the state of the data both before and after the transaction is executed (Use transaction maintains the consistency of the state of the data) 3. Isolation means that transactions can run at the same time., there is no concurrency (locking mechanism is required) 4. Durability refers to the impact of an outage or a failure on a running transaction (data survives any failures) To summarize, a transaction will either complete, producing correct results, or terminate, with no effect.
  6. Bulk loading for tables created in phoenix is easier compared to tables created in HBase shell.
  7. (Server Bounce) Adminstrator/Technician removes power to the device in a "non-controlled shutdown.“ The "down" part of the bounce. Once the server is completely off, and all activity has ceased, the administrator restarts the server.
  8. Set phoenix.transactions.enabled property to true along with running transaction manager (included in distribution) to enable full ACID transactions (Tables may optionally be declared as transactionaltable may optionally be declared as transactional). A concurrency model is used to detect row level conflicts with first commit wins semantics. The later commit would produce an exception indicating that a conflict was detected. A transaction is started implicitly when a transactional table is referenced in a statement. at which no updates can be seen from other connections until either a commit or rollback occurs. A non transactional tables will not see their updates until after a commit has occurred. 
  9. Phoenix uses the value of this connection property as the max timestamp of scans. Timestamps may not be controlled for transactional tables. Instead, the transaction manager assigns timestamps which become the HBase cell timestamps after a commit. Timestamps are multiplied by 1,000,000 to ensure enough granularity for uniqueness across the cluster.
  10. Snapshot queries over older data will pick up and use the correct schema based on the time of connection (Based on CurrentSCN). Data updates such as addition or removal of a table column or the updates of table statistics. 1. ALWAYS value will cause the client to check with the server each time a statement is executed that references a table  (or once per commit for an UPSERT VALUES statement. 2. Millisecond value indicates how long the client will hold on to its cached version of the metadata before checking back with the server for updates.
  11. From scratch -> HBase table and column families will be created automatically. Mapped to existing -> The binary representation of the row key and key values must match that of the Phoenix data types
  12. 1. The primary use case for a VIEW is to transfer existing data into a Phoenix table. A table could also be declared as salted to prevent HBase region hot spotting.  The table catalog argument in the metadata APIs is used to filter based on the tenant ID for multi-tenant tables. 2. since data modification are not allowed on a VIEW and query performance will likely be less than as with a TABLE. Phoenix supports updatable views on top of tables with the unique feature leveraging the schemaless capabilities of HBase of being able to add columns to them. All views all share the same underlying physical HBase table and may even be indexed independently. A multi-tenant view may add columns which are defined solely for that user.
  13. 1. The primary use case for a VIEW is to transfer existing data into a Phoenix table. A table could also be declared as salted to prevent HBase region hot spotting.  The table catalog argument in the metadata APIs is used to filter based on the tenant ID for multi-tenant tables. 2. since data modification are not allowed on a VIEW and query performance will likely be less than as with a TABLE. Phoenix supports updatable views on top of tables with the unique feature leveraging the schemaless capabilities of HBase of being able to add columns to them. All views all share the same underlying physical HBase table and may even be indexed independently. A multi-tenant view may add columns which are defined solely for that user.
  14. Phoenix chunks up query using guidePosts, which means more threads working on a single region. Phoenix runs the queries in parallel on the client using a configurable number of threads. Aggregation is done in a coprocessor on the server-side, reducing the amount of data that is returned to the client.
  15. Phoenix chunks up query using guidePosts, which means more threads working on a single region. Phoenix runs the queries in parallel on the client using a configurable number of threads. Aggregation is done in a coprocessor on the server-side, reducing the amount of data that is returned to the client.
  16. Phoenix chunks up query using guidePosts, which means more threads working on a single region. Phoenix runs the queries in parallel on the client using a configurable number of threads. Aggregation is done in a coprocessor on the server-side, reducing the amount of data that is returned to the client.
  17. ETL is a type of data integration that refers to the three steps used to blend data from multiple sources. It's often used to build a data warehouse.
  18. Data Manipulation Language (DML). Data Definition Language (DDL). For CREATE TABLE: 1. Any HBase metadata (table, column families) that doesn’t already exist will be created. 2. KEEP_DELETED_CELLS option is enabled to allow for flashback queries to work correctly. 3. an empty key value will also be added for each row so that queries behave as expected (without requiring all columns to be projected during scans).  For CREATE VIEW: Instead the existing HBase metadata must match the metadata specified in the DDL statement (or table read only error). For UPSERT VALUES: Use It multiple times before comminting mutations batching For UPSERT SELECT: Configure phoenix.mutate.batchSize based on row size Write scans directly to Hbase and to write on the server while running upsert select on the same table by setting auto-commit to true
  19. Enhance existing statistics collection by enabling further query optmizations based on the size and cardinality of the data. Generate histograms to drive query optimization decisions such as secondary index usage and join ordering based on cardinalities to produce the most efficient query plan.
  20. Secondary Indexies Types: Global Index (Optimized for read heavy use case), local index (Optimized for write heavy space constrained use cases) and functional index (Create index on arbitrary expression). Hbase tables are sorted maps. Star schema is the simplest style of data mart schema (separates business process data into facts), approach is widely used to develop data warehouses and dimensional data mart. The star schema consists of one or more fact tables referencing any number of dimension tables. Fact table contains measurements, metrics, and facts about a business process while the Dimension table is a companion to the fact table which contains descriptive attributes to be used as query constraining Types of Dimension Table: Slowly Changing Dimension, Conformed Dimension, Junk Dimension, Degenerate Dimension, Roleplay Dimension
  21. Maps Hbase native timestamp to a Phoenix column. Take advantage of various optimizations that HBase provides for time ranges. ROW_TIMESTAMP needs to be a primary key column in a date or time format (Specified in documentations for more details). Only one primary key can be designated with ROW_TIMESTAMP, decleration upon table creation (No null or negative values allowed).
  22. Cache content on server through 2 main parts (SQL Read, SQL Write) with end user and collecting content from content providers.
  23. Cache content on server through 2 main parts (SQL Read, SQL Write) with end user and collecting content from content providers.
Advertisement