Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splice Machine

Replace Oracle with
Hadoop
John Leach Co-Founder and CTO
August 4, 2014

2
Data Doubling Every 2 Years…
Driven by web, social, mobile, and Internet of Things
Source: 2013 IBM Briefing Book

3
Traditional RDBMSs Overwhelmed…
Scale-up becoming cost-prohibitive
Oracle is
too darn
expensive! My DB is
hitting
the wall
Users keep
getting those
spinning
beach balls
We have to
throw data
away
Our reports
take forever

4
Scale-Out: The Future of Databases
Dramatic improvement in price/performance
Scale Up
(Increase server size)
Scale Out
(More small servers)
vs.
$ $ $ $ $ $

5
Who are We?
THE ONLY
HADOOP RDBMS
Replace your old RDBMS
with a scale-out SQL database
Affordable, Scale-Out
ACID Transactions
No Application Rewrites
10x
Better
Price/Perf

6
Case Study: Harte-Hanks
Overview
Digital marketing services provider
Real-time campaign management
Complex OLTP and OLAP
environment
Challenges
Oracle RAC too expensive to scale
Queries too slow – even up to ½ hour
Getting worse – expect 30-50% data growth
Looked for 9 months for a cost-effective solution
Solution Diagram Initial Results
¼ cost
with commodity scale out
3-7x faster
through parallelized queries
10-20x price/perf
with no application, BI or ETL rewrites
Cross-Channel
Campaigns
Real-Time
Personalization
Real-Time Actions

Use Cases
 Digital Marketing
 Campaign management
 Unified Customer Profile
 Real-time personalization
 Data Lake
 Operational reporting and analytics
 Operational Data Stores
 Fraud Detection
 Personalized Medicine
 Internet of Things
 Network monitoring
 Cyber-threat security
 Wearables and sensors
7

8
Reference Architecture: Operational Apps
Provide affordable scale-out for applications with a high concurrency of real-time reads/writes
3rd Party
Data Sources
Operational App
(e.g., Unica Campaign Mgmt)
Customers
Operational
Employees
Operational
Reports &
Analytics

9
Reference Architecture: Operational Data Lake
Offload real-time reporting and analytics from expensive OLTP and DW systems
OLTP
Systems
Ad Hoc
Analytics
Operational
Data Lake
Executive
Business
Reports
Operational
Reports &
Analytics
ERP
CRM
Supply
Chain
HR
… Data
Warehouse
Datamart
Stream or
batch updates
ETL
Real-Time,
Event-Driven
Apps

10
Reference Architecture: Unified Customer Profile
Improve marketing ROI with deeper customer intelligence and better cross-channel coordination
Unified
Customer Profile
(aka DMP)
Operational Reports
for Campaign Perf.
Social
Feeds
Web/eCommerce
Clickstreams
Website
Datamart
Stream or
batch updates
BI Tools
Real-time
personalization
data
Demand
Side Platform
(DSP)
Ad Exchange
ADD GOOGLE,
FB logos
1st Party/
CRM Data
3rd Party Data
(e.g., Axciom)
Ad Perf. Data
(e.g., Doubleclick)
Email Mktg Data
Call Center Data
POS Data
Email
Marketing
App
Ad Hoc
Audience
Segmentatio
n
BI Tools

11
Customer Performance Benchmarks
Typically 10x price/performance improvement
30x
3-7x
10-20x
10x
20x
10-15x
7x
5x
SPEED
PRICE/PERFORMANCEVS.
30x

12
Combines the Best of Both Worlds
 Scale-out on commodity servers
 Proven to 100s of petabytes
 Efficiently handle sparse data
 Extensive ecosystem
RDBMS
 ANSI SQL
 Real-time updates
 ACID transactions
 ODBC/JDBC support
Hadoop

14
Proven Building Blocks: Hadoop and Derby
APACHE DERBY
 ANSI SQL-99 RDBMS
 Java-based
 ODBC/JDBC Compliant
APACHE HBASE/HDFS
 Auto-sharding
 Real-time updates
 Fault-tolerance
 Scalability to 100s of PBs
 Data replication

Derby
 100% JAVA ANSI SQL RDBMS – CLI, JDBC, embedded
 Modular, Lightweight, Unicode
 Authentication and Authorization
 Concurrency
 Project History
 Started as Cloudscape in 1996
 Acquired by Informix… then IBM…
 IBM Contributed code to Apache project in 2004
 An active Apache project with conservative development
 DB2 influence. Many of the same limits/features
 Has Oracle’s stamp of approval – Java DB and included in JDK6
15

Derby Advanced Features
 Java Stored Procedures
 Triggers
 Two-phase commit (XA Support)
 Updatable SQL Views
 Full Transaction Isolation Support
 Encryption
 Custom Functions
16

Splice SQL Processing
 PreparedStatement ps = conn.prepareStatement(“SELECT * FROM
T WHERE ID = ?”);
1. Look up in cache using exact text match (skip to 6 if plan found
in cache)
2. Parse using JavaCC generated parser
3. Bind to dictionary, acquire types
4. Optimize Plan
5. Generate code for plan
6. Create instance of plan
17

Splice Details
 Parse Phase
 Forms explicit tree of query nodes representing statement
 Generate Phase
 Generate Java byte code (an Activation) directly into an in-memory byte array
 Loaded with special ClassLoader that loads from the byte array
 Binds arguments to proper types
 Optimize Phase
 Determine feasible join strategies
 Optimize based on cost estimates
 Execute Phase
 Instantiates arguments to represent specific statement state
 Expressions are methods on Activation
 Trees of ResultSets generated that represent the state of the query
18

Splice Modifications to Derby
19
Derby Component Derby Splice Version
Store Block File-based HBase Tables
Indexes B-Tree Dense index in HBase Table
Concurrency Lock-based, Aries MVCC - Snapshot Isolation
Project-Restrict Plan Predicates on centralized file
scanner
Predicates pushed to shards
and locally applied
Aggregation Plan Aggregation serially computed Aggregations pushed to shards
and spliced together
Join Plan Centralized Hash and NLJ
chosen by optimizer
Distributed Broadcast, Sort-
Merge, Merge, NLJ, and Batch
NLJ chosen by optimizer
Resource Management Number of Connections and
Memory Limitations
Task Resource Queues and
Write Governor

20
HBase: Proven Scale-Out
 Auto-sharding
 Scales with commodity hardware
 Cost-effective from GBs to PBs
 High availability thru failover
and replication
 LSM-trees

21
Distributed, Parallelized Query Execution
Parallelized computation across cluster
Moves computation to the data
Utilizes HBase co-processors
No MapReduce

Splice HBase Extensions
 Asynchronous Write Pipeline
 Non-blocking, flushable writes
 Writes data, indexes, and constraints (index) concurrently
 Batches writes in chunks for bulk WAL Edits vs. single WAL Edits
 Synchronization free internal scanner vs. synchronized external scanner
 Linux Scheduler Modeled Resource Manager
 Resource Queues that handle DDL, DML, Dictionary, and Maintenance
Operations
 Sparse Data Support
 Efficiently store sparse data
 Does not store nulls
22

Schema Advantages
 Non-Blocking Schema Changes
 Add columns in a DDL transaction
 No read/write locks while adding columns
 Sparse Data Support
 Efficiently store sparse data
 Does not store nulls
23

ANSI SQL-99 Coverage
24
 Data types – e.g., INTEGER, REAL,
CHARACTER, DATE, BOOLEAN, BIGINT
 DDL – e.g., CREATE TABLE, CREATE SCHEMA,
ALTER TABLE, DELETE, UPDATE
 Predicates – e.g., IN, BETWEEN, LIKE, EXISTS
 DML – e.g., INSERT, DELETE, UPDATE, SELECT
 Query specification – e.g., SELECT DISTINCT,
GROUP BY, HAVING
 SET functions – e.g., UNION, ABS, MOD, ALL,
CHECK
 Aggregation functions – e.g., AVG, MAX,
COUNT
 String functions – e.g., SUBSTRING,
concatenation, UPPER, LOWER, POSITION,
TRIM, LENGTH
 Conditional functions – e.g., CASE, searched
CASE
 Privileges – e.g., privileges for SELECT,
DELETE, INSERT, EXECUTE
 Cursors – e.g., updatable, read-only,
positioned DELETE/UPDATE
 Joins – e.g., INNER JOIN, LEFT OUTER JOIN
 Transactions – e.g., COMMIT, ROLLBACK,
READ COMMITTED, REPEATABLE READ, READ
UNCOMMITTED, Snapshot Isolation
 Sub-queries
 Triggers
 User-defined functions (UDFs)
 Views – including grouped views

25
Lockless, ACID transactions
State-of-the-Art Snapshot Isolation
25
Adds multi-row, multi-table transactions
to HBase with rollback
Fast, lockless, high concurrency
ZooKeeper coordination
Extends research from Google
Percolator, Yahoo Labs, U of Waterloo
Transaction A
Transaction B
Transaction C
Ts Tc

26
BI and SQL tool support via ODBC
No application rewrites needed
26

Demonstration
 ANSI SQL
 Comprehensive core SQL-99
 Standard SQL tool connectivity
 Real-time CRUD operations
 Seamless BI integration
 ACID Transactions
 State-of-the-art snapshot isolation
 Extends Google Percolator design
 Real-Time Apps
 eCommerce website powered by
Splice Machine
 Operational Analytics
 Complex, OLAP aggregations
27

SQL Database Ecosystem
28
Ad-hoc Analytics Operational(OLTP + OLAP)
Lower Cost
Higher Cost
High Concurrency
Ingest: Real-time Updates
Operate on 100s of records
Low Concurrency
Ingest: Batch Loads
Scan PBs of data at a time
Commodity Hardware
10x price/performance
Proprietary/Custom Hardware
Millions of Dollars

SQL Database Ecosystem
29
Ad-hoc Analytics Operational (OLTP + OLAP)
New SQL
IN-MEMORY
RDBMSMPP
New SQL
Proprietary HW
Lower
Cost
Higher
Cost
Hadoop
RDBMS
SQL-on-Hadoop
Phoenix
SQL-on-HBase

What People are Saying…
30
Recognized as a key innovator in databases
Scaling out on Splice
Machine presented
some major benefits
over Oracle
...automatic balancing between
clusters...avoiding the costly
licensing issues.
Quotes
Awards
An alternative to today’s
RDBMSes,
Splice Machine effectively
combines traditional relational
database technology with
the scale-out capabilities
of Hadoop.
The unique claim of … Splice
Machine is that it can run
transactional applications
as well as support analytics on
top of Hadoop.

31
Summary
THE ONLY
HADOOP RDBMS
Replace your old RDBMS
with a scale-out SQL database
Affordable, Scale-Out
ACID Transactions
No Application Rewrites
10x
Better
Price/Perf

Next Steps
32
Try Us!
Free
Download
Proof of
Concept

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splice Machine

More Related Content

What's hot

Viewers also liked

Similar to Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splice Machine

More from Chicago Hadoop Users Group

Recently uploaded

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splice Machine