Replace Oracle with
Hadoop
John Leach Co-Founder and CTO
August 4, 2014
2
Data Doubling Every 2 Years…
Driven by web, social, mobile, and Internet of Things
Source: 2013 IBM Briefing Book
3
Traditional RDBMSs Overwhelmed…
Scale-up becoming cost-prohibitive
Oracle is
too darn
expensive! My DB is
hitting
the wall
Users keep
getting those
spinning
beach balls
We have to
throw data
away
Our reports
take forever
4
Scale-Out: The Future of Databases
Dramatic improvement in price/performance
Scale Up
(Increase server size)
Scale Out
(More small servers)
vs.
$ $ $ $ $ $
5
Who are We?
THE ONLY
HADOOP RDBMS
Replace your old RDBMS
with a scale-out SQL database
Affordable, Scale-Out
ACID Transactions
No Application Rewrites
10x
Better
Price/Perf
6
Case Study: Harte-Hanks
Overview
Digital marketing services provider
Real-time campaign management
Complex OLTP and OLAP
environment
Challenges
Oracle RAC too expensive to scale
Queries too slow – even up to ½ hour
Getting worse – expect 30-50% data growth
Looked for 9 months for a cost-effective solution
Solution Diagram Initial Results
¼ cost
with commodity scale out
3-7x faster
through parallelized queries
10-20x price/perf
with no application, BI or ETL rewrites
Cross-Channel
Campaigns
Real-Time
Personalization
Real-Time Actions
Use Cases
 Digital Marketing
 Campaign management
 Unified Customer Profile
 Real-time personalization
 Data Lake
 Operational reporting and analytics
 Operational Data Stores
 Fraud Detection
 Personalized Medicine
 Internet of Things
 Network monitoring
 Cyber-threat security
 Wearables and sensors
7
8
Reference Architecture: Operational Apps
Provide affordable scale-out for applications with a high concurrency of real-time reads/writes
3rd Party
Data Sources
Operational App
(e.g., Unica Campaign Mgmt)
Customers
Operational
Employees
Operational
Reports &
Analytics
9
Reference Architecture: Operational Data Lake
Offload real-time reporting and analytics from expensive OLTP and DW systems
OLTP
Systems
Ad Hoc
Analytics
Operational
Data Lake
Executive
Business
Reports
Operational
Reports &
Analytics
ERP
CRM
Supply
Chain
HR
… Data
Warehouse
Datamart
Stream or
batch updates
ETL
Real-Time,
Event-Driven
Apps
10
Reference Architecture: Unified Customer Profile
Improve marketing ROI with deeper customer intelligence and better cross-channel coordination
Unified
Customer Profile
(aka DMP)
Operational Reports
for Campaign Perf.
Social
Feeds
Web/eCommerce
Clickstreams
Website
Datamart
Stream or
batch updates
BI Tools
Real-time
personalization
data
Demand
Side Platform
(DSP)
Ad Exchange
ADD GOOGLE,
FB logos
1st Party/
CRM Data
3rd Party Data
(e.g., Axciom)
Ad Perf. Data
(e.g., Doubleclick)
Email Mktg Data
Call Center Data
POS Data
Email
Marketing
App
Ad Hoc
Audience
Segmentatio
n
BI Tools
11
Customer Performance Benchmarks
Typically 10x price/performance improvement
30x
3-7x
10-20x
10x
20x
10-15x
7x
5x
SPEED
PRICE/PERFORMANCEVS.
30x
12
Combines the Best of Both Worlds
 Scale-out on commodity servers
 Proven to 100s of petabytes
 Efficiently handle sparse data
 Extensive ecosystem
RDBMS
 ANSI SQL
 Real-time updates
 ACID transactions
 ODBC/JDBC support
Hadoop
Product Overview
14
Proven Building Blocks: Hadoop and Derby
APACHE DERBY
 ANSI SQL-99 RDBMS
 Java-based
 ODBC/JDBC Compliant
APACHE HBASE/HDFS
 Auto-sharding
 Real-time updates
 Fault-tolerance
 Scalability to 100s of PBs
 Data replication
Derby
 100% JAVA ANSI SQL RDBMS – CLI, JDBC, embedded
 Modular, Lightweight, Unicode
 Authentication and Authorization
 Concurrency
 Project History
 Started as Cloudscape in 1996
 Acquired by Informix… then IBM…
 IBM Contributed code to Apache project in 2004
 An active Apache project with conservative development
 DB2 influence. Many of the same limits/features
 Has Oracle’s stamp of approval – Java DB and included in JDK6
15
Derby Advanced Features
 Java Stored Procedures
 Triggers
 Two-phase commit (XA Support)
 Updatable SQL Views
 Full Transaction Isolation Support
 Encryption
 Custom Functions
16
Splice SQL Processing
 PreparedStatement ps = conn.prepareStatement(“SELECT * FROM
T WHERE ID = ?”);
1. Look up in cache using exact text match (skip to 6 if plan found
in cache)
2. Parse using JavaCC generated parser
3. Bind to dictionary, acquire types
4. Optimize Plan
5. Generate code for plan
6. Create instance of plan
17
Splice Details
 Parse Phase
 Forms explicit tree of query nodes representing statement
 Generate Phase
 Generate Java byte code (an Activation) directly into an in-memory byte array
 Loaded with special ClassLoader that loads from the byte array
 Binds arguments to proper types
 Optimize Phase
 Determine feasible join strategies
 Optimize based on cost estimates
 Execute Phase
 Instantiates arguments to represent specific statement state
 Expressions are methods on Activation
 Trees of ResultSets generated that represent the state of the query
18
Splice Modifications to Derby
19
Derby Component Derby Splice Version
Store Block File-based HBase Tables
Indexes B-Tree Dense index in HBase Table
Concurrency Lock-based, Aries MVCC - Snapshot Isolation
Project-Restrict Plan Predicates on centralized file
scanner
Predicates pushed to shards
and locally applied
Aggregation Plan Aggregation serially computed Aggregations pushed to shards
and spliced together
Join Plan Centralized Hash and NLJ
chosen by optimizer
Distributed Broadcast, Sort-
Merge, Merge, NLJ, and Batch
NLJ chosen by optimizer
Resource Management Number of Connections and
Memory Limitations
Task Resource Queues and
Write Governor
20
HBase: Proven Scale-Out
 Auto-sharding
 Scales with commodity hardware
 Cost-effective from GBs to PBs
 High availability thru failover
and replication
 LSM-trees
21
Distributed, Parallelized Query Execution
Parallelized computation across cluster
Moves computation to the data
Utilizes HBase co-processors
No MapReduce
Splice HBase Extensions
 Asynchronous Write Pipeline
 Non-blocking, flushable writes
 Writes data, indexes, and constraints (index) concurrently
 Batches writes in chunks for bulk WAL Edits vs. single WAL Edits
 Synchronization free internal scanner vs. synchronized external scanner
 Linux Scheduler Modeled Resource Manager
 Resource Queues that handle DDL, DML, Dictionary, and Maintenance
Operations
 Sparse Data Support
 Efficiently store sparse data
 Does not store nulls
22
Schema Advantages
 Non-Blocking Schema Changes
 Add columns in a DDL transaction
 No read/write locks while adding columns
 Sparse Data Support
 Efficiently store sparse data
 Does not store nulls
23
ANSI SQL-99 Coverage
24
 Data types – e.g., INTEGER, REAL,
CHARACTER, DATE, BOOLEAN, BIGINT
 DDL – e.g., CREATE TABLE, CREATE SCHEMA,
ALTER TABLE, DELETE, UPDATE
 Predicates – e.g., IN, BETWEEN, LIKE, EXISTS
 DML – e.g., INSERT, DELETE, UPDATE, SELECT
 Query specification – e.g., SELECT DISTINCT,
GROUP BY, HAVING
 SET functions – e.g., UNION, ABS, MOD, ALL,
CHECK
 Aggregation functions – e.g., AVG, MAX,
COUNT
 String functions – e.g., SUBSTRING,
concatenation, UPPER, LOWER, POSITION,
TRIM, LENGTH
 Conditional functions – e.g., CASE, searched
CASE
 Privileges – e.g., privileges for SELECT,
DELETE, INSERT, EXECUTE
 Cursors – e.g., updatable, read-only,
positioned DELETE/UPDATE
 Joins – e.g., INNER JOIN, LEFT OUTER JOIN
 Transactions – e.g., COMMIT, ROLLBACK,
READ COMMITTED, REPEATABLE READ, READ
UNCOMMITTED, Snapshot Isolation
 Sub-queries
 Triggers
 User-defined functions (UDFs)
 Views – including grouped views
25
Lockless, ACID transactions
State-of-the-Art Snapshot Isolation
25
Adds multi-row, multi-table transactions
to HBase with rollback
Fast, lockless, high concurrency
ZooKeeper coordination
Extends research from Google
Percolator, Yahoo Labs, U of Waterloo
Transaction A
Transaction B
Transaction C
Ts Tc
26
BI and SQL tool support via ODBC
No application rewrites needed
26
Demonstration
 ANSI SQL
 Comprehensive core SQL-99
 Standard SQL tool connectivity
 Real-time CRUD operations
 Seamless BI integration
 ACID Transactions
 State-of-the-art snapshot isolation
 Extends Google Percolator design
 Real-Time Apps
 eCommerce website powered by
Splice Machine
 Operational Analytics
 Complex, OLAP aggregations
27
SQL Database Ecosystem
28
Ad-hoc Analytics Operational(OLTP + OLAP)
Lower Cost
Higher Cost
High Concurrency
Ingest: Real-time Updates
Operate on 100s of records
Low Concurrency
Ingest: Batch Loads
Scan PBs of data at a time
Commodity Hardware
10x price/performance
Proprietary/Custom Hardware
Millions of Dollars
SQL Database Ecosystem
29
Ad-hoc Analytics Operational (OLTP + OLAP)
New SQL
IN-MEMORY
RDBMSMPP
New SQL
Proprietary HW
Lower
Cost
Higher
Cost
Hadoop
RDBMS
SQL-on-Hadoop
Phoenix
SQL-on-HBase
What People are Saying…
30
Recognized as a key innovator in databases
Scaling out on Splice
Machine presented
some major benefits
over Oracle
...automatic balancing between
clusters...avoiding the costly
licensing issues.
Quotes
Awards
An alternative to today’s
RDBMSes,
Splice Machine effectively
combines traditional relational
database technology with
the scale-out capabilities
of Hadoop.
The unique claim of … Splice
Machine is that it can run
transactional applications
as well as support analytics on
top of Hadoop.
31
Summary
THE ONLY
HADOOP RDBMS
Replace your old RDBMS
with a scale-out SQL database
Affordable, Scale-Out
ACID Transactions
No Application Rewrites
10x
Better
Price/Perf
Next Steps
32
Try Us!
Free
Download
Proof of
Concept

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splice Machine

  • 1.
    Replace Oracle with Hadoop JohnLeach Co-Founder and CTO August 4, 2014
  • 2.
    2 Data Doubling Every2 Years… Driven by web, social, mobile, and Internet of Things Source: 2013 IBM Briefing Book
  • 3.
    3 Traditional RDBMSs Overwhelmed… Scale-upbecoming cost-prohibitive Oracle is too darn expensive! My DB is hitting the wall Users keep getting those spinning beach balls We have to throw data away Our reports take forever
  • 4.
    4 Scale-Out: The Futureof Databases Dramatic improvement in price/performance Scale Up (Increase server size) Scale Out (More small servers) vs. $ $ $ $ $ $
  • 5.
    5 Who are We? THEONLY HADOOP RDBMS Replace your old RDBMS with a scale-out SQL database Affordable, Scale-Out ACID Transactions No Application Rewrites 10x Better Price/Perf
  • 6.
    6 Case Study: Harte-Hanks Overview Digitalmarketing services provider Real-time campaign management Complex OLTP and OLAP environment Challenges Oracle RAC too expensive to scale Queries too slow – even up to ½ hour Getting worse – expect 30-50% data growth Looked for 9 months for a cost-effective solution Solution Diagram Initial Results ¼ cost with commodity scale out 3-7x faster through parallelized queries 10-20x price/perf with no application, BI or ETL rewrites Cross-Channel Campaigns Real-Time Personalization Real-Time Actions
  • 7.
    Use Cases  DigitalMarketing  Campaign management  Unified Customer Profile  Real-time personalization  Data Lake  Operational reporting and analytics  Operational Data Stores  Fraud Detection  Personalized Medicine  Internet of Things  Network monitoring  Cyber-threat security  Wearables and sensors 7
  • 8.
    8 Reference Architecture: OperationalApps Provide affordable scale-out for applications with a high concurrency of real-time reads/writes 3rd Party Data Sources Operational App (e.g., Unica Campaign Mgmt) Customers Operational Employees Operational Reports & Analytics
  • 9.
    9 Reference Architecture: OperationalData Lake Offload real-time reporting and analytics from expensive OLTP and DW systems OLTP Systems Ad Hoc Analytics Operational Data Lake Executive Business Reports Operational Reports & Analytics ERP CRM Supply Chain HR … Data Warehouse Datamart Stream or batch updates ETL Real-Time, Event-Driven Apps
  • 10.
    10 Reference Architecture: UnifiedCustomer Profile Improve marketing ROI with deeper customer intelligence and better cross-channel coordination Unified Customer Profile (aka DMP) Operational Reports for Campaign Perf. Social Feeds Web/eCommerce Clickstreams Website Datamart Stream or batch updates BI Tools Real-time personalization data Demand Side Platform (DSP) Ad Exchange ADD GOOGLE, FB logos 1st Party/ CRM Data 3rd Party Data (e.g., Axciom) Ad Perf. Data (e.g., Doubleclick) Email Mktg Data Call Center Data POS Data Email Marketing App Ad Hoc Audience Segmentatio n BI Tools
  • 11.
    11 Customer Performance Benchmarks Typically10x price/performance improvement 30x 3-7x 10-20x 10x 20x 10-15x 7x 5x SPEED PRICE/PERFORMANCEVS. 30x
  • 12.
    12 Combines the Bestof Both Worlds  Scale-out on commodity servers  Proven to 100s of petabytes  Efficiently handle sparse data  Extensive ecosystem RDBMS  ANSI SQL  Real-time updates  ACID transactions  ODBC/JDBC support Hadoop
  • 13.
  • 14.
    14 Proven Building Blocks:Hadoop and Derby APACHE DERBY  ANSI SQL-99 RDBMS  Java-based  ODBC/JDBC Compliant APACHE HBASE/HDFS  Auto-sharding  Real-time updates  Fault-tolerance  Scalability to 100s of PBs  Data replication
  • 15.
    Derby  100% JAVAANSI SQL RDBMS – CLI, JDBC, embedded  Modular, Lightweight, Unicode  Authentication and Authorization  Concurrency  Project History  Started as Cloudscape in 1996  Acquired by Informix… then IBM…  IBM Contributed code to Apache project in 2004  An active Apache project with conservative development  DB2 influence. Many of the same limits/features  Has Oracle’s stamp of approval – Java DB and included in JDK6 15
  • 16.
    Derby Advanced Features Java Stored Procedures  Triggers  Two-phase commit (XA Support)  Updatable SQL Views  Full Transaction Isolation Support  Encryption  Custom Functions 16
  • 17.
    Splice SQL Processing PreparedStatement ps = conn.prepareStatement(“SELECT * FROM T WHERE ID = ?”); 1. Look up in cache using exact text match (skip to 6 if plan found in cache) 2. Parse using JavaCC generated parser 3. Bind to dictionary, acquire types 4. Optimize Plan 5. Generate code for plan 6. Create instance of plan 17
  • 18.
    Splice Details  ParsePhase  Forms explicit tree of query nodes representing statement  Generate Phase  Generate Java byte code (an Activation) directly into an in-memory byte array  Loaded with special ClassLoader that loads from the byte array  Binds arguments to proper types  Optimize Phase  Determine feasible join strategies  Optimize based on cost estimates  Execute Phase  Instantiates arguments to represent specific statement state  Expressions are methods on Activation  Trees of ResultSets generated that represent the state of the query 18
  • 19.
    Splice Modifications toDerby 19 Derby Component Derby Splice Version Store Block File-based HBase Tables Indexes B-Tree Dense index in HBase Table Concurrency Lock-based, Aries MVCC - Snapshot Isolation Project-Restrict Plan Predicates on centralized file scanner Predicates pushed to shards and locally applied Aggregation Plan Aggregation serially computed Aggregations pushed to shards and spliced together Join Plan Centralized Hash and NLJ chosen by optimizer Distributed Broadcast, Sort- Merge, Merge, NLJ, and Batch NLJ chosen by optimizer Resource Management Number of Connections and Memory Limitations Task Resource Queues and Write Governor
  • 20.
    20 HBase: Proven Scale-Out Auto-sharding  Scales with commodity hardware  Cost-effective from GBs to PBs  High availability thru failover and replication  LSM-trees
  • 21.
    21 Distributed, Parallelized QueryExecution Parallelized computation across cluster Moves computation to the data Utilizes HBase co-processors No MapReduce
  • 22.
    Splice HBase Extensions Asynchronous Write Pipeline  Non-blocking, flushable writes  Writes data, indexes, and constraints (index) concurrently  Batches writes in chunks for bulk WAL Edits vs. single WAL Edits  Synchronization free internal scanner vs. synchronized external scanner  Linux Scheduler Modeled Resource Manager  Resource Queues that handle DDL, DML, Dictionary, and Maintenance Operations  Sparse Data Support  Efficiently store sparse data  Does not store nulls 22
  • 23.
    Schema Advantages  Non-BlockingSchema Changes  Add columns in a DDL transaction  No read/write locks while adding columns  Sparse Data Support  Efficiently store sparse data  Does not store nulls 23
  • 24.
    ANSI SQL-99 Coverage 24 Data types – e.g., INTEGER, REAL, CHARACTER, DATE, BOOLEAN, BIGINT  DDL – e.g., CREATE TABLE, CREATE SCHEMA, ALTER TABLE, DELETE, UPDATE  Predicates – e.g., IN, BETWEEN, LIKE, EXISTS  DML – e.g., INSERT, DELETE, UPDATE, SELECT  Query specification – e.g., SELECT DISTINCT, GROUP BY, HAVING  SET functions – e.g., UNION, ABS, MOD, ALL, CHECK  Aggregation functions – e.g., AVG, MAX, COUNT  String functions – e.g., SUBSTRING, concatenation, UPPER, LOWER, POSITION, TRIM, LENGTH  Conditional functions – e.g., CASE, searched CASE  Privileges – e.g., privileges for SELECT, DELETE, INSERT, EXECUTE  Cursors – e.g., updatable, read-only, positioned DELETE/UPDATE  Joins – e.g., INNER JOIN, LEFT OUTER JOIN  Transactions – e.g., COMMIT, ROLLBACK, READ COMMITTED, REPEATABLE READ, READ UNCOMMITTED, Snapshot Isolation  Sub-queries  Triggers  User-defined functions (UDFs)  Views – including grouped views
  • 25.
    25 Lockless, ACID transactions State-of-the-ArtSnapshot Isolation 25 Adds multi-row, multi-table transactions to HBase with rollback Fast, lockless, high concurrency ZooKeeper coordination Extends research from Google Percolator, Yahoo Labs, U of Waterloo Transaction A Transaction B Transaction C Ts Tc
  • 26.
    26 BI and SQLtool support via ODBC No application rewrites needed 26
  • 27.
    Demonstration  ANSI SQL Comprehensive core SQL-99  Standard SQL tool connectivity  Real-time CRUD operations  Seamless BI integration  ACID Transactions  State-of-the-art snapshot isolation  Extends Google Percolator design  Real-Time Apps  eCommerce website powered by Splice Machine  Operational Analytics  Complex, OLAP aggregations 27
  • 28.
    SQL Database Ecosystem 28 Ad-hocAnalytics Operational(OLTP + OLAP) Lower Cost Higher Cost High Concurrency Ingest: Real-time Updates Operate on 100s of records Low Concurrency Ingest: Batch Loads Scan PBs of data at a time Commodity Hardware 10x price/performance Proprietary/Custom Hardware Millions of Dollars
  • 29.
    SQL Database Ecosystem 29 Ad-hocAnalytics Operational (OLTP + OLAP) New SQL IN-MEMORY RDBMSMPP New SQL Proprietary HW Lower Cost Higher Cost Hadoop RDBMS SQL-on-Hadoop Phoenix SQL-on-HBase
  • 30.
    What People areSaying… 30 Recognized as a key innovator in databases Scaling out on Splice Machine presented some major benefits over Oracle ...automatic balancing between clusters...avoiding the costly licensing issues. Quotes Awards An alternative to today’s RDBMSes, Splice Machine effectively combines traditional relational database technology with the scale-out capabilities of Hadoop. The unique claim of … Splice Machine is that it can run transactional applications as well as support analytics on top of Hadoop.
  • 31.
    31 Summary THE ONLY HADOOP RDBMS Replaceyour old RDBMS with a scale-out SQL database Affordable, Scale-Out ACID Transactions No Application Rewrites 10x Better Price/Perf
  • 32.