The Evolution of a Relational Database Layer over HBase

DataWorks Summit
DataWorks SummitDataWorks Summit
The Evolution of a Relational Database Layer over HBase
@ApachePhoenix
http://phoenix.apache.org/
James Taylor (@JamesPlusPlus)
V5
About James
• Architect at Salesforce.com
– Part of the Big Data group
• Lead of Apache Phoenix project
• PMC member of Apache Calcite
• Engineer and Product Manager at BEA Systems
– XQuery-based federated query engine
– SQL-based complex event processing engine
• Various startups prior to that
Agenda
• What is Apache Phoenix?
• State of the Union
• A Deeper Look
– Joins and Subquery Support
• What’s New?
• What’s Next?
• Q&A
What is Apache Phoenix?
• A relational database layer for Apache HBase
– Query engine
• Transforms SQL queries into native HBase API calls
• Pushes as much work as possible onto the cluster for parallel
execution
– Metadata repository
• Typed access to data stored in HBase tables
– A JDBC driver
• A top level Apache Software Foundation project
– Originally developed at Salesforce
– Now a top-level project at the ASF (Happy Birthday!)
– A growing community with momentum
Where Does Phoenix Fit In?
Sqoop
RDBDataCollector
Flume
LogDataCollector
Zookeeper
Coordination
YARN (MRv2)
Cluster Resource
Manager /
MapReduce
HDFS 2.0
Hadoop Distributed File System
GraphX
Graph analysis
framework
Phoenix
Query execution engine
HBase
Distributed Database
The Java Virtual Machine
Hadoop
Common JNI
Spark
Iterative In-Memory
Computation
MLLib
Data mining
Pig
Data Manipulation
Hive
Structured Query
Phoenix
JDBC client
State of the Union
• Broad enough SQL support to run TPC queries
– Joins, Sub-queries, Derived tables, etc.
• Three different secondary indexing strategies
– Immutable for write-once/append only data
– Global for read-heavy mutable data
– Local for write-heavy mutable or immutable data
• Statistics driven parallel execution
• Tracing and metrics for Monitoring & Management
Join and Subquery Support
• Grammar: inner join; left/right/full outer join; cross join
• Additional: semi join; anti join
• Algorithms: hash-join; sort-merge-join
• Optimizations:
– Predicate push-down
– FK-to-PK join optimization
– Global index with missing data columns
– Correlated subquery rewrite
TPC Example 1
Small-Quantity-Order Revenue Query (Q17)
select sum(l_extendedprice) / 7.0 as avg_yearly
from lineitem, part
where p_partkey = l_partkey
and p_brand = '[B]'
and p_container = '[C]'
and l_quantity < (
select 0.2 * avg(l_quantity)
from lineitem
where l_partkey = p_partkey
);
CLIENT 4-WAY FULL SCAN OVER lineitem
PARALLEL INNER JOIN TABLE 0
CLIENT 1-WAY FULL SCAN OVER part
SERVER FILTER BY p_partkey = ‘[B]’ AND p_container = ‘[C]’
PARALLEL INNER JOIN TABLE 1
CLIENT 4-WAY FULL SCAN OVER lineitem
SERVER AGGREGATE INTO DISTINCT ROWS BY l_partkey
AFTER-JOIN SERVER FILTER BY l_quantity < $0
TPC Example 2
Order Priority Checking Query (Q4)
select o_orderpriority, count(*) as order_count
from orders
where o_orderdate >= date '[D]'
and o_orderdate < date '[D]' + interval '3' month
and exists (
select * from lineitem
where l_orderkey = o_orderkey and l_commitdate < l_receiptdate
)
group by o_orderpriority
order by o_orderpriority;
CLIENT 4-WAY FULL SCAN OVER orders
SERVER FILTER o_orderdate >= ‘[D]’ AND o_orderdate < ‘[D]’ + 3(d)
SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY o_orderpriority
CLIENT MERGE SORT
SKIP-SCAN JOIN TABLE 0
CLIENT 4-WAY FULL SCAN OVER lineitem
SERVER FILTER BY l_commitdate < l_receiptdate
DYNAMIC SERVER FILTER BY o_orderkey IN l_orderkey
Join support - what can't we do?
• Nested Loop Join
• Statistics Guided Join Algorithm
– Smartly choose the smaller table for the build side
– Smartly switch between hash-join and sort-merge-join
– Smartly turn on/off FK-to-PK join optimization
What’s New?
• HBase 1.0 Support
• Functional Indexes
Functional Indexes
• Creating an index on an expression as opposed to just a
column value. For example, the following will be a full
table scan:
SELECT AVG(response_time) FROM SERVER_METRICS
WHERE DAYOFMONTH(create_time) = 1
• Adding the following functional index will turn it
into a range scan:
CREATE INDEX day_of_month_idx
ON SERVER_METRICS (DAYOFMONTH(create_time))
INCLUDE (response_time)
What’s New?
• HBase 1.0 Support
• Functional Indexes
• User Defined Functions
User Defined Functions
• Extension points to Phoenix for domain-specific
functions. For example, a geo-location application might
load a set of UDFs like this:
CREATE FUNCTION WOEID_DISTANCE(INTEGER,INTEGER)
RETURNS INTEGER AS ‘org.apache.geo.woeidDistance’
USING JAR ‘/lib/geo/geoloc.jar’
• Querying, functional indexing, etc. then possible:
SELECT * FROM woeid a JOIN woeid b ON a.country = b.country
WHERE woeid_distance(a.ID,b.ID) < 5
What’s New?
• HBase 1.0 Support
• Functional Indexes
• User Defined Functions
• Query Server with Thin Driver
Query Server + Thin Driver
• Offloads query planning and execution to different server(s)
• Minimizes client dependencies
– Enabler for ODBC driver (not available yet, though)
• Connect like this instead:
Connection conn = DriverManager.getConnection(
“jdbc:phoenix:thin:url=http://localhost:8765”);
• Still evolving, so no backward compatibility guarantees yet
• For more information, see
http://phoenix.apache.org/server.html
What’s New?
• HBase 1.0 Support
• Functional Indexes
• User Defined Functions
• Query Server with Thin Driver
• Union All support
• Testing at scale with Pherf
• MR index build
• Spark integration
• Date built-in functions – WEEK, DAYOFMONTH, etc.
• Transactions (WIP - will be in next release)
Transactions
• Snapshot isolation model
– Using Tephra (http://tephra.io/)
– Supports REPEABLE_READ isolation level
– Allows reading your own uncommitted data
• Optional
– Enabled on a table by table basis
– No performance penalty when not used
• Work in progress, but close to release
– Try our txn branch
– Will be available in next release
Optimistic Concurrency Control
• Avoids cost of locking rows and tables
• No deadlocks or lock escalations
• Cost of conflict detection and possible rollback is higher
• Good if conflicts are rare: short transaction, disjoint
partitioning of work
• Conflict detection not always necessary: write-once/append-
only data
Tephra Architecture
ZooKeeper
Tx Manager
(standby)
HBase
Master 1
Master 2
RS 1
RS 2 RS 4
RS 3
Client 1
Client 2
Client N
Tx Manager
(active)
time out
try abort
failed
roll back
in HBase
write
to
HBase
do work
Client Tx Manager
none
complete V
abortsucceeded
in progress
start tx
start start tx
commit
try commit check conflicts
invalid X
invalidate
failed
Transaction Lifecycle
Tephra Architecture
• TransactionAware client
• Coordinates transaction lifecycle with manager
• Communicates directly with HBase for reads and writes
• Transaction Manager
• Assigns transaction IDs
• Maintains state on in-progress, committed and invalid transactions
• Transaction Processor coprocessor
• Applies server-side filtering for reads
• Cleans up data from failed transactions, and no longer visible
versions
What’s New?
• HBase 1.0 Support
• Functional Indexes
• User Defined Functions
• Query Server with Thin Driver
• Union All support
• Testing at scale with Pherf
• MR index build
• Spark integration
• Date built-in functions – WEEK, DAYOFMONTH, etc.
• Transactions (WIP - will be in next release)
What’s Next?
• Is Phoenix done?
• What about the Big Picture?
– How can Phoenix be leveraged in the larger ecosystem?
– Hive, Pig, Spark, MR integration with Phoenix exists
today, but not a great story
What’s Next?
You are here
Introducing Apache Calcite
• Query parser, compiler, and planner framework
– SQL-92 compliant (ever argue SQL with Julian? :-) )
– Enables Phoenix to get missing SQL support
• Pluggable cost-based optimizer framework
– Sane way to model push down through rules
• Interop with other Calcite adaptors
– Not for free, but it becomes feasible
– Already used by Drill, Hive, Kylin, Samza
– Supports any JDBC source (i.e. RDBMS - remember them :-) )
– One cost-model to rule them all
How does Phoenix plug in?
Calcite Parser & Validator
Calcite Query Optimizer
Phoenix Query Plan Generator
Phoenix Runtime
Phoenix Tables over HBase
JDBC Client
SQL + Phoenix
specific
grammar Built-in rules
+ Phoenix
specific rules
Optimization Rules
• AggregateRemoveRule
• FilterAggregateTransposeRule
• FilterJoinRule
• FilterMergeRule
• JoinCommuteRule
• PhoenixFilterScanMergeRule
• PhoenixJoinSingleAggregateMergeRule
• …
Query Example
(filter push-down and smart join algorithm)
LogicalFilter
filter: $0 = ‘x’
LogicalJoin
type: inner
cond: $3 = $7
LogicalProject
projects: $0, $5
LogicalTableScan
table: A
LogicalTableScan
table: B
PhoenixTableScan
table: ‘a’
filter: $0 = ‘x’
PhoenixServerJoin
type: inner
cond: $3 = $1
PhoenixServerProject
projects: $2, $0
Optimizer
(with
RelOptRules &
ConvertRules)
PhoenixTableScan
table: ‘b’
PhoenixServerProject
projects: $0, $2
PhoenixServerProject
projects: $0, $3
Query Example
(filter push-down and smart join algorithm)
ScanPlan
table: ‘a’
skip-scan: pk0 = ‘x’
projects: pk0, c3
HashJoinPlan
types {inner}
join-keys: {$1}
projects: $2, $0
Build
hash-key: $1
Phoenix
Implementor
PhoenixTableScan
table: ‘a’
filter: $0 = ‘x’
PhoenixServerJoin
type: inner
cond: $3 = $1
PhoenixServerProject
projects: $2, $0
PhoenixTableScan
table: ‘b’
PhoenixServerProject
projects: $0, $2
PhoenixServerProject
projects: $0, $3
ScanPlan
table: ‘b’
projects: col0, col2
Probe
Interoperibility Example
• Joining data from Phoenix and mySQL
EnumerableJoin
PhoenixTableScan JdbcTableScan
Phoenix Tables over HBase mySQL Database
PhoenixToEnumerable
Converter
JdbcToEnumerable
Converter
Query Example 1
WITH m AS
(SELECT *
FROM dept_manager dm
WHERE from_date =
(SELECT max(from_date)
FROM dept_manager dm2
WHERE dm.dept_no = dm2.dept_no))
SELECT m.dept_no, d.dept_name, e.first_name, e.last_name
FROM employees e
JOIN m ON e.emp_no = m.emp_no
JOIN departments d ON d.dept_no = m.dept_no
ORDER BY d.dept_no;
Query Example 2
SELECT dept_no, title, count(*)
FROM titles t
JOIN dept_emp de ON t.emp_no = de.emp_no
WHERE dept_no <= 'd006'
GROUP BY rollup(dept_no, title)
ORDER BY dept_no, title;
Thank you!
Questions?
* who uses Phoenix
1 of 34

Recommended

Apache phoenix: Past, Present and Future of SQL over HBAse by
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseenissoz
6.3K views41 slides
Apache Phoenix: Transforming HBase into a SQL Database by
Apache Phoenix: Transforming HBase into a SQL DatabaseApache Phoenix: Transforming HBase into a SQL Database
Apache Phoenix: Transforming HBase into a SQL DatabaseDataWorks Summit
16.7K views77 slides
Apache Big Data EU 2015 - Phoenix by
Apache Big Data EU 2015 - PhoenixApache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - PhoenixNick Dimiduk
4.7K views40 slides
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase by
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseDataWorks Summit/Hadoop Summit
3.3K views41 slides
April 2014 HUG : Apache Phoenix by
April 2014 HUG : Apache PhoenixApril 2014 HUG : Apache Phoenix
April 2014 HUG : Apache PhoenixYahoo Developer Network
6.2K views21 slides
Apache phoenix by
Apache phoenixApache phoenix
Apache phoenixUniversity of Moratuwa
1.1K views19 slides

More Related Content

What's hot

Five major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster by
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix clusterFive major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix clustermas4share
1.3K views32 slides
Apache Phoenix + Apache HBase by
Apache Phoenix + Apache HBaseApache Phoenix + Apache HBase
Apache Phoenix + Apache HBaseDataWorks Summit/Hadoop Summit
7.3K views43 slides
Apache Big Data EU 2015 - HBase by
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseNick Dimiduk
3.3K views40 slides
Local Secondary Indexes in Apache Phoenix by
Local Secondary Indexes in Apache PhoenixLocal Secondary Indexes in Apache Phoenix
Local Secondary Indexes in Apache PhoenixRajeshbabu Chintaguntla
2.6K views18 slides
Apache Phoenix: Use Cases and New Features by
Apache Phoenix: Use Cases and New FeaturesApache Phoenix: Use Cases and New Features
Apache Phoenix: Use Cases and New FeaturesHBaseCon
4.7K views39 slides
Dancing with the elephant h base1_final by
Dancing with the elephant   h base1_finalDancing with the elephant   h base1_final
Dancing with the elephant h base1_finalasterix_smartplatf
1.9K views24 slides

What's hot(20)

Five major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster by mas4share
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix clusterFive major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster
mas4share1.3K views
Apache Big Data EU 2015 - HBase by Nick Dimiduk
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
Nick Dimiduk3.3K views
Apache Phoenix: Use Cases and New Features by HBaseCon
Apache Phoenix: Use Cases and New FeaturesApache Phoenix: Use Cases and New Features
Apache Phoenix: Use Cases and New Features
HBaseCon4.7K views
Hortonworks Technical Workshop: HBase and Apache Phoenix by Hortonworks
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks15K views
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ... by HBaseCon
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon4.3K views
Apache Phoenix Query Server by Josh Elser
Apache Phoenix Query ServerApache Phoenix Query Server
Apache Phoenix Query Server
Josh Elser2.8K views
Apache Phoenix Query Server PhoenixCon2016 by Josh Elser
Apache Phoenix Query Server PhoenixCon2016Apache Phoenix Query Server PhoenixCon2016
Apache Phoenix Query Server PhoenixCon2016
Josh Elser2.2K views
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory... by Cloudera, Inc.
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
Cloudera, Inc.4.1K views
HBase state of the union by enissoz
HBase   state of the unionHBase   state of the union
HBase state of the union
enissoz552 views
Mapreduce over snapshots by enissoz
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshots
enissoz6.4K views
Practical Kerberos with Apache HBase by Josh Elser
Practical Kerberos with Apache HBasePractical Kerberos with Apache HBase
Practical Kerberos with Apache HBase
Josh Elser2.4K views

Viewers also liked

Introducción a Apache HBase by
Introducción a Apache HBaseIntroducción a Apache HBase
Introducción a Apache HBaseMarcos Ortiz Valmaseda
3K views20 slides
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce by
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
41.7K views189 slides
Breaking with relational dbms and dating with hbase by
Breaking with relational dbms and dating with hbaseBreaking with relational dbms and dating with hbase
Breaking with relational dbms and dating with hbaseGaurav Kohli
3.1K views40 slides
Apache Ambari - What's New in 1.4.2 by
Apache Ambari - What's New in 1.4.2Apache Ambari - What's New in 1.4.2
Apache Ambari - What's New in 1.4.2Hortonworks
2K views8 slides
Seo for-bloggers-2014 by
Seo for-bloggers-2014Seo for-bloggers-2014
Seo for-bloggers-2014Rand Fishkin
70.6K views97 slides
Hadoop World 2011: Advanced HBase Schema Design by
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
17.9K views33 slides

Viewers also liked(20)

HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce by Cloudera, Inc.
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Cloudera, Inc.41.7K views
Breaking with relational dbms and dating with hbase by Gaurav Kohli
Breaking with relational dbms and dating with hbaseBreaking with relational dbms and dating with hbase
Breaking with relational dbms and dating with hbase
Gaurav Kohli3.1K views
Apache Ambari - What's New in 1.4.2 by Hortonworks
Apache Ambari - What's New in 1.4.2Apache Ambari - What's New in 1.4.2
Apache Ambari - What's New in 1.4.2
Hortonworks2K views
Seo for-bloggers-2014 by Rand Fishkin
Seo for-bloggers-2014Seo for-bloggers-2014
Seo for-bloggers-2014
Rand Fishkin70.6K views
Hadoop World 2011: Advanced HBase Schema Design by Cloudera, Inc.
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
Cloudera, Inc.17.9K views
Evaluating NoSQL Performance: Time for Benchmarking by Sergey Bushik
Evaluating NoSQL Performance: Time for BenchmarkingEvaluating NoSQL Performance: Time for Benchmarking
Evaluating NoSQL Performance: Time for Benchmarking
Sergey Bushik28.6K views
Near-realtime analytics with Kafka and HBase by dave_revell
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
dave_revell23.6K views
Apache HBase 1.0 Release by Nick Dimiduk
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 Release
Nick Dimiduk25.6K views
Netflix Webkit-Based UI for TV Devices by Matt McCarthy
Netflix Webkit-Based UI for TV DevicesNetflix Webkit-Based UI for TV Devices
Netflix Webkit-Based UI for TV Devices
Matt McCarthy19.7K views
Couchbase Performance Benchmarking by Renat Khasanshyn
Couchbase Performance BenchmarkingCouchbase Performance Benchmarking
Couchbase Performance Benchmarking
Renat Khasanshyn33.7K views
Realtime Analytics with Hadoop and HBase by larsgeorge
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
larsgeorge20.3K views
Deploying and Managing Hadoop Clusters with AMBARI by DataWorks Summit
Deploying and Managing Hadoop Clusters with AMBARIDeploying and Managing Hadoop Clusters with AMBARI
Deploying and Managing Hadoop Clusters with AMBARI
DataWorks Summit29.9K views
Apache HBase for Architects by Nick Dimiduk
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for Architects
Nick Dimiduk11.2K views
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi by Slim Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi16.5K views
HBaseCon 2015: Analyzing HBase Data with Apache Hive by HBaseCon
HBaseCon 2015: Analyzing HBase Data with Apache  HiveHBaseCon 2015: Analyzing HBase Data with Apache  Hive
HBaseCon 2015: Analyzing HBase Data with Apache Hive
HBaseCon7.9K views
HBase for Architects by Nick Dimiduk
HBase for ArchitectsHBase for Architects
HBase for Architects
Nick Dimiduk33.7K views
20090713 Hbase Schema Design Case Studies by Evan Liu
20090713 Hbase Schema Design Case Studies20090713 Hbase Schema Design Case Studies
20090713 Hbase Schema Design Case Studies
Evan Liu18.5K views
Intro to HBase Internals & Schema Design (for HBase users) by alexbaranau
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
alexbaranau28.1K views
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database by Edureka!
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
Edureka!104.5K views

Similar to The Evolution of a Relational Database Layer over HBase

HBaseCon2015-final by
HBaseCon2015-finalHBaseCon2015-final
HBaseCon2015-finalMaryann Xue
127 views35 slides
HBaseCon2016-final by
HBaseCon2016-finalHBaseCon2016-final
HBaseCon2016-finalMaryann Xue
323 views38 slides
What's New in .Net 4.5 by
What's New in .Net 4.5What's New in .Net 4.5
What's New in .Net 4.5Malam Team
3K views31 slides
ONE FOR ALL! Using Apache Calcite to make SQL smart by
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartEvans Ye
735 views53 slides
Data all over the place! How SQL and Apache Calcite bring sanity to streaming... by
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
4K views56 slides
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs by
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIsFabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIsFlink Forward
423 views36 slides

Similar to The Evolution of a Relational Database Layer over HBase(20)

HBaseCon2015-final by Maryann Xue
HBaseCon2015-finalHBaseCon2015-final
HBaseCon2015-final
Maryann Xue127 views
HBaseCon2016-final by Maryann Xue
HBaseCon2016-finalHBaseCon2016-final
HBaseCon2016-final
Maryann Xue323 views
What's New in .Net 4.5 by Malam Team
What's New in .Net 4.5What's New in .Net 4.5
What's New in .Net 4.5
Malam Team3K views
ONE FOR ALL! Using Apache Calcite to make SQL smart by Evans Ye
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smart
Evans Ye735 views
Data all over the place! How SQL and Apache Calcite bring sanity to streaming... by Julian Hyde
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Julian Hyde4K views
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs by Flink Forward
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIsFabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Flink Forward423 views
Taking a look under the hood of Apache Flink's relational APIs. by Fabian Hueske
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske2.7K views
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive by Xu Jiang
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Xu Jiang33.1K views
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba by Michael Stack
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
Michael Stack163 views
Running Airflow Workflows as ETL Processes on Hadoop by clairvoyantllc
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
clairvoyantllc2K views
Distributed Kafka Architecture Taboola Scale by Apache Kafka TLV
Distributed Kafka Architecture Taboola ScaleDistributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola Scale
Apache Kafka TLV792 views
Distributed & Highly Available server applications in Java and Scala by Max Alexejev
Distributed & Highly Available server applications in Java and ScalaDistributed & Highly Available server applications in Java and Scala
Distributed & Highly Available server applications in Java and Scala
Max Alexejev5.6K views
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc... by Databricks
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks10.8K views
phoenix-on-calcite-nyc-meetup by Maryann Xue
phoenix-on-calcite-nyc-meetupphoenix-on-calcite-nyc-meetup
phoenix-on-calcite-nyc-meetup
Maryann Xue143 views
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers by Lucidworks
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Lucidworks173 views
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite by Salesforce Engineering
A Smarter Pig: Building a SQL interface to Pig using Apache CalciteA Smarter Pig: Building a SQL interface to Pig using Apache Calcite
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
Performance Optimizations in Apache Impala by Cloudera, Inc.
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.10.7K views
Managing multi tenant resource toward Hive 2.0 by Kai Sasaki
Managing multi tenant resource toward Hive 2.0Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0
Kai Sasaki2.2K views
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark by Michael Stack
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and SparkHBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
Michael Stack742 views
WinOps Conf 2016 - Michael Greene - Release Pipelines by WinOps Conf
WinOps Conf 2016 - Michael Greene - Release PipelinesWinOps Conf 2016 - Michael Greene - Release Pipelines
WinOps Conf 2016 - Michael Greene - Release Pipelines
WinOps Conf1.2K views

More from DataWorks Summit

Data Science Crash Course by
Data Science Crash CourseData Science Crash Course
Data Science Crash CourseDataWorks Summit
19.3K views47 slides
Floating on a RAFT: HBase Durability with Apache Ratis by
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
2.9K views20 slides
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi by
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
2.1K views19 slides
HBase Tales From the Trenches - Short stories about most common HBase operati... by
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
1.8K views18 slides
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac... by
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
1.6K views74 slides
Managing the Dewey Decimal System by
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
1K views8 slides

More from DataWorks Summit(20)

Floating on a RAFT: HBase Durability with Apache Ratis by DataWorks Summit
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit2.9K views
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi by DataWorks Summit
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit2.1K views
HBase Tales From the Trenches - Short stories about most common HBase operati... by DataWorks Summit
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit1.8K views
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac... by DataWorks Summit
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit1.6K views
Practical NoSQL: Accumulo's dirlist Example by DataWorks Summit
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit834 views
HBase Global Indexing to support large-scale data ingestion at Uber by DataWorks Summit
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit915 views
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix by DataWorks Summit
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit714 views
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi by DataWorks Summit
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit1.3K views
Supporting Apache HBase : Troubleshooting and Supportability Improvements by DataWorks Summit
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit1.8K views
Security Framework for Multitenant Architecture by DataWorks Summit
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit1.1K views
Presto: Optimizing Performance of SQL-on-Anything Engine by DataWorks Summit
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit1.8K views
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl... by DataWorks Summit
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit3.2K views
Extending Twitter's Data Platform to Google Cloud by DataWorks Summit
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit1K views
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi by DataWorks Summit
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit4K views
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger by DataWorks Summit
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit957 views
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory... by DataWorks Summit
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit771 views
Computer Vision: Coming to a Store Near You by DataWorks Summit
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit214 views
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark by DataWorks Summit
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit615 views

Recently uploaded

PharoJS - Zürich Smalltalk Group Meetup November 2023 by
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023Noury Bouraqadi
139 views17 slides
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...Bernd Ruecker
48 views69 slides
Microsoft Power Platform.pptx by
Microsoft Power Platform.pptxMicrosoft Power Platform.pptx
Microsoft Power Platform.pptxUni Systems S.M.S.A.
61 views38 slides
"Surviving highload with Node.js", Andrii Shumada by
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada Fwdays
33 views29 slides
Ransomware is Knocking your Door_Final.pdf by
Ransomware is Knocking your Door_Final.pdfRansomware is Knocking your Door_Final.pdf
Ransomware is Knocking your Door_Final.pdfSecurity Bootcamp
66 views46 slides
The Research Portal of Catalonia: Growing more (information) & more (services) by
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)CSUC - Consorci de Serveis Universitaris de Catalunya
115 views25 slides

Recently uploaded(20)

PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi139 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker48 views
"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays33 views
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit... by ShapeBlue
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
ShapeBlue40 views
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ... by ShapeBlue
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
ShapeBlue61 views
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
Why and How CloudStack at weSystems - Stephan Bienek - weSystems by ShapeBlue
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystems
ShapeBlue81 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray1042 views
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by ShapeBlue
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue46 views
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue75 views
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... by ShapeBlue
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
ShapeBlue26 views
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De... by Moses Kemibaro
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Moses Kemibaro27 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software317 views

The Evolution of a Relational Database Layer over HBase

  • 1. The Evolution of a Relational Database Layer over HBase @ApachePhoenix http://phoenix.apache.org/ James Taylor (@JamesPlusPlus) V5
  • 2. About James • Architect at Salesforce.com – Part of the Big Data group • Lead of Apache Phoenix project • PMC member of Apache Calcite • Engineer and Product Manager at BEA Systems – XQuery-based federated query engine – SQL-based complex event processing engine • Various startups prior to that
  • 3. Agenda • What is Apache Phoenix? • State of the Union • A Deeper Look – Joins and Subquery Support • What’s New? • What’s Next? • Q&A
  • 4. What is Apache Phoenix? • A relational database layer for Apache HBase – Query engine • Transforms SQL queries into native HBase API calls • Pushes as much work as possible onto the cluster for parallel execution – Metadata repository • Typed access to data stored in HBase tables – A JDBC driver • A top level Apache Software Foundation project – Originally developed at Salesforce – Now a top-level project at the ASF (Happy Birthday!) – A growing community with momentum
  • 5. Where Does Phoenix Fit In? Sqoop RDBDataCollector Flume LogDataCollector Zookeeper Coordination YARN (MRv2) Cluster Resource Manager / MapReduce HDFS 2.0 Hadoop Distributed File System GraphX Graph analysis framework Phoenix Query execution engine HBase Distributed Database The Java Virtual Machine Hadoop Common JNI Spark Iterative In-Memory Computation MLLib Data mining Pig Data Manipulation Hive Structured Query Phoenix JDBC client
  • 6. State of the Union • Broad enough SQL support to run TPC queries – Joins, Sub-queries, Derived tables, etc. • Three different secondary indexing strategies – Immutable for write-once/append only data – Global for read-heavy mutable data – Local for write-heavy mutable or immutable data • Statistics driven parallel execution • Tracing and metrics for Monitoring & Management
  • 7. Join and Subquery Support • Grammar: inner join; left/right/full outer join; cross join • Additional: semi join; anti join • Algorithms: hash-join; sort-merge-join • Optimizations: – Predicate push-down – FK-to-PK join optimization – Global index with missing data columns – Correlated subquery rewrite
  • 8. TPC Example 1 Small-Quantity-Order Revenue Query (Q17) select sum(l_extendedprice) / 7.0 as avg_yearly from lineitem, part where p_partkey = l_partkey and p_brand = '[B]' and p_container = '[C]' and l_quantity < ( select 0.2 * avg(l_quantity) from lineitem where l_partkey = p_partkey ); CLIENT 4-WAY FULL SCAN OVER lineitem PARALLEL INNER JOIN TABLE 0 CLIENT 1-WAY FULL SCAN OVER part SERVER FILTER BY p_partkey = ‘[B]’ AND p_container = ‘[C]’ PARALLEL INNER JOIN TABLE 1 CLIENT 4-WAY FULL SCAN OVER lineitem SERVER AGGREGATE INTO DISTINCT ROWS BY l_partkey AFTER-JOIN SERVER FILTER BY l_quantity < $0
  • 9. TPC Example 2 Order Priority Checking Query (Q4) select o_orderpriority, count(*) as order_count from orders where o_orderdate >= date '[D]' and o_orderdate < date '[D]' + interval '3' month and exists ( select * from lineitem where l_orderkey = o_orderkey and l_commitdate < l_receiptdate ) group by o_orderpriority order by o_orderpriority; CLIENT 4-WAY FULL SCAN OVER orders SERVER FILTER o_orderdate >= ‘[D]’ AND o_orderdate < ‘[D]’ + 3(d) SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY o_orderpriority CLIENT MERGE SORT SKIP-SCAN JOIN TABLE 0 CLIENT 4-WAY FULL SCAN OVER lineitem SERVER FILTER BY l_commitdate < l_receiptdate DYNAMIC SERVER FILTER BY o_orderkey IN l_orderkey
  • 10. Join support - what can't we do? • Nested Loop Join • Statistics Guided Join Algorithm – Smartly choose the smaller table for the build side – Smartly switch between hash-join and sort-merge-join – Smartly turn on/off FK-to-PK join optimization
  • 11. What’s New? • HBase 1.0 Support • Functional Indexes
  • 12. Functional Indexes • Creating an index on an expression as opposed to just a column value. For example, the following will be a full table scan: SELECT AVG(response_time) FROM SERVER_METRICS WHERE DAYOFMONTH(create_time) = 1 • Adding the following functional index will turn it into a range scan: CREATE INDEX day_of_month_idx ON SERVER_METRICS (DAYOFMONTH(create_time)) INCLUDE (response_time)
  • 13. What’s New? • HBase 1.0 Support • Functional Indexes • User Defined Functions
  • 14. User Defined Functions • Extension points to Phoenix for domain-specific functions. For example, a geo-location application might load a set of UDFs like this: CREATE FUNCTION WOEID_DISTANCE(INTEGER,INTEGER) RETURNS INTEGER AS ‘org.apache.geo.woeidDistance’ USING JAR ‘/lib/geo/geoloc.jar’ • Querying, functional indexing, etc. then possible: SELECT * FROM woeid a JOIN woeid b ON a.country = b.country WHERE woeid_distance(a.ID,b.ID) < 5
  • 15. What’s New? • HBase 1.0 Support • Functional Indexes • User Defined Functions • Query Server with Thin Driver
  • 16. Query Server + Thin Driver • Offloads query planning and execution to different server(s) • Minimizes client dependencies – Enabler for ODBC driver (not available yet, though) • Connect like this instead: Connection conn = DriverManager.getConnection( “jdbc:phoenix:thin:url=http://localhost:8765”); • Still evolving, so no backward compatibility guarantees yet • For more information, see http://phoenix.apache.org/server.html
  • 17. What’s New? • HBase 1.0 Support • Functional Indexes • User Defined Functions • Query Server with Thin Driver • Union All support • Testing at scale with Pherf • MR index build • Spark integration • Date built-in functions – WEEK, DAYOFMONTH, etc. • Transactions (WIP - will be in next release)
  • 18. Transactions • Snapshot isolation model – Using Tephra (http://tephra.io/) – Supports REPEABLE_READ isolation level – Allows reading your own uncommitted data • Optional – Enabled on a table by table basis – No performance penalty when not used • Work in progress, but close to release – Try our txn branch – Will be available in next release
  • 19. Optimistic Concurrency Control • Avoids cost of locking rows and tables • No deadlocks or lock escalations • Cost of conflict detection and possible rollback is higher • Good if conflicts are rare: short transaction, disjoint partitioning of work • Conflict detection not always necessary: write-once/append- only data
  • 20. Tephra Architecture ZooKeeper Tx Manager (standby) HBase Master 1 Master 2 RS 1 RS 2 RS 4 RS 3 Client 1 Client 2 Client N Tx Manager (active)
  • 21. time out try abort failed roll back in HBase write to HBase do work Client Tx Manager none complete V abortsucceeded in progress start tx start start tx commit try commit check conflicts invalid X invalidate failed Transaction Lifecycle
  • 22. Tephra Architecture • TransactionAware client • Coordinates transaction lifecycle with manager • Communicates directly with HBase for reads and writes • Transaction Manager • Assigns transaction IDs • Maintains state on in-progress, committed and invalid transactions • Transaction Processor coprocessor • Applies server-side filtering for reads • Cleans up data from failed transactions, and no longer visible versions
  • 23. What’s New? • HBase 1.0 Support • Functional Indexes • User Defined Functions • Query Server with Thin Driver • Union All support • Testing at scale with Pherf • MR index build • Spark integration • Date built-in functions – WEEK, DAYOFMONTH, etc. • Transactions (WIP - will be in next release)
  • 24. What’s Next? • Is Phoenix done? • What about the Big Picture? – How can Phoenix be leveraged in the larger ecosystem? – Hive, Pig, Spark, MR integration with Phoenix exists today, but not a great story
  • 26. Introducing Apache Calcite • Query parser, compiler, and planner framework – SQL-92 compliant (ever argue SQL with Julian? :-) ) – Enables Phoenix to get missing SQL support • Pluggable cost-based optimizer framework – Sane way to model push down through rules • Interop with other Calcite adaptors – Not for free, but it becomes feasible – Already used by Drill, Hive, Kylin, Samza – Supports any JDBC source (i.e. RDBMS - remember them :-) ) – One cost-model to rule them all
  • 27. How does Phoenix plug in? Calcite Parser & Validator Calcite Query Optimizer Phoenix Query Plan Generator Phoenix Runtime Phoenix Tables over HBase JDBC Client SQL + Phoenix specific grammar Built-in rules + Phoenix specific rules
  • 28. Optimization Rules • AggregateRemoveRule • FilterAggregateTransposeRule • FilterJoinRule • FilterMergeRule • JoinCommuteRule • PhoenixFilterScanMergeRule • PhoenixJoinSingleAggregateMergeRule • …
  • 29. Query Example (filter push-down and smart join algorithm) LogicalFilter filter: $0 = ‘x’ LogicalJoin type: inner cond: $3 = $7 LogicalProject projects: $0, $5 LogicalTableScan table: A LogicalTableScan table: B PhoenixTableScan table: ‘a’ filter: $0 = ‘x’ PhoenixServerJoin type: inner cond: $3 = $1 PhoenixServerProject projects: $2, $0 Optimizer (with RelOptRules & ConvertRules) PhoenixTableScan table: ‘b’ PhoenixServerProject projects: $0, $2 PhoenixServerProject projects: $0, $3
  • 30. Query Example (filter push-down and smart join algorithm) ScanPlan table: ‘a’ skip-scan: pk0 = ‘x’ projects: pk0, c3 HashJoinPlan types {inner} join-keys: {$1} projects: $2, $0 Build hash-key: $1 Phoenix Implementor PhoenixTableScan table: ‘a’ filter: $0 = ‘x’ PhoenixServerJoin type: inner cond: $3 = $1 PhoenixServerProject projects: $2, $0 PhoenixTableScan table: ‘b’ PhoenixServerProject projects: $0, $2 PhoenixServerProject projects: $0, $3 ScanPlan table: ‘b’ projects: col0, col2 Probe
  • 31. Interoperibility Example • Joining data from Phoenix and mySQL EnumerableJoin PhoenixTableScan JdbcTableScan Phoenix Tables over HBase mySQL Database PhoenixToEnumerable Converter JdbcToEnumerable Converter
  • 32. Query Example 1 WITH m AS (SELECT * FROM dept_manager dm WHERE from_date = (SELECT max(from_date) FROM dept_manager dm2 WHERE dm.dept_no = dm2.dept_no)) SELECT m.dept_no, d.dept_name, e.first_name, e.last_name FROM employees e JOIN m ON e.emp_no = m.emp_no JOIN departments d ON d.dept_no = m.dept_no ORDER BY d.dept_no;
  • 33. Query Example 2 SELECT dept_no, title, count(*) FROM titles t JOIN dept_emp de ON t.emp_no = de.emp_no WHERE dept_no <= 'd006' GROUP BY rollup(dept_no, title) ORDER BY dept_no, title;

Editor's Notes

  1. Who here is already familiar with Phoenix? GitHub -> Incubator -> TLP 1000 msg / month -> 2000 1 year old today
  2. TPC = complex queries used to benchmark SQL databases against each other
  3. All types of; algorithms. FK-PK opt Useful in global index. Other opt Many TPC queries.
  4. a yearly average price for orders of a specific part brand and part container with a quantity less than 20% of the average quantity of orders for the same part. join + correlated subquery. Two opt in query plan: 1st one de-correlation. 2nd one predicate push-down.
  5. An example of EXISTS => semi-join Triggers another opt, FK-PK join opt In query plan, SKIP-SCAN-JOIN with a dynamic filter At runtime, a skip-scan not a full-scan on orders table
  6. Something missing. 2 join algorithms, hash and merge. Former faster vs. latter for two large tables. How to decide which algorithm? Can’t. Prioritize one. Can’t do the join side either. Are we going to? Yes. Table stats for choosing join algorithm and optimization.
  7. Jeffrey & Enis Thomas
  8. Rajeshbabu
  9. Nick & Julian
  10. Maryann, myself, and Alicia Cody & Mujtaba Ravi Josh Mahonin Alicia Thomas, myself, and Gary Helmling
  11. Slides courtesy of Gary and Andreas Go to Gary’s talk on CDAP at 4:10
  12. Ran out of room – didn’t even mention the 8x perf improvement for unordered, unaggregated queries by Samarth Fantastic work by a lot of people to pull this together
  13. Join ordering and other optimizations now possible
  14. Details of integration: Position and interact? 1. A customized Parser + Validator 2. Query Optimizer + own table stats + Phoenix rules. 3. Translation process 4. Phoenix Runtime
  15. A join query with a WHERE condition. Highlight filter push-down and swap of join tables, called FilterJoinTransposeRule and JoinCommuteRule. Conversion from Logical to Phoenix physical at the same time. Opt: Filter on table ‘A’ … The tree on the right => output A good example of how Calcite can make the decision of join algorithms easy. 
  16. Default implementation of backend: Enumerable RelNodes w/ adapters, run Phoenix + other data source. Example of joining Phoenix with JDBC: EnumerableJoin …  Can replace JDBC table with one from other data source.
  17. WITH: we don’t have for front-end but equivalent to derived table. Get the grammar from Calcite and run in Phoenix.
  18. ROLLUP group-by: part to Phoenix and rest to itself.