Turning Relational Database Tables into Hadoop Datasources by Kuassi Mensah

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Turning Relational Database Tables into
Hadoop Datasources
Oracle Confidential – Internal/Restricted/Highly Restricted
Kuassi Mensah
Director of Product Management
Java & Hadoop Products for the DB
@kmensah – db360.blogspot.com

Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for
information purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied upon
in making purchasing decisions. The development, release, and timing of any features or
functionality described for Oracle’s products remains at the sole discretion of Oracle.
Oracle Confidential – Internal/Restricted/Highly Restricted 3

Speaker Bio
• Director of Product Management at Oracle
(i) Java integration with the Oracle database (JDBC, UCP, Java in the database)
(ii) Oracle Datasource for Hadoop (OD4H), upcoming OD for Spark, OD for Flink and so on
(iii) JavaScript/Nashorn integration with the Oracle database (DB access, JS stored proc, fluent JS )
• MS CS from the Programming Institute of University of Paris
• Frequent speaker
JavaOne, Oracle Open World, Data Summit, Node Summit, Oracle User groups (UKOUG, DOAG,OUGN,
BGOUG, OUGF, GUOB, ArOUG, ORAMEX, Sangam,OTNYathra, China, Thailand, etc),
• Author: Oracle Database Programming using Java and Web Services
• @kmensah, http://db360.blogspot.com/, https://www.linkedin.com/in/kmensah

Program Agenda
Big Data Analytics Requirements
Opportunity: Hive Storage Handler
Storage Handler Implementation for Oracle
1
2
3

Big Data Analytics
• Goal: furnish actionable information to help business decisions making.
• Example
“Which of our products got a rating of four stars or higher, on social
media in the last quarter?

Big Data Analytics and Requirements
• Goal: furnish actionable information to help business decisions making.
• Example
“Which of our products got a rating of four stars or higher, on social
media in the last quarter?
Master Data
Big Data
(Weblogs, Facts, Scans, Events, IoT)

• ETL Copy
– Preplanned/scheduled
• What to copy and when?
• Always behind
– Copy is protected using Hadoop file-
level security
Apache Sqoop, Oracle CopyToBDA
• Direct Access from Hadoop
– Ad-hoc queries, always current
– Hive SQL, Spark SQL, Impala*,
other SQL engines
– Hadoop APIs
– Database security
Oracle Datasource for Hadoop (OD4H)
Accessing Master Data in RDBMS

Direct Access From Hadoop
Dummy Example
SELECT HiveTab.First_Name, HiveTab.Last_Name, OraTab.bonus
FROM HiveTab join OraTab on (HiveTab.Emp_ID=OraTab.Emp_ID)
WHERE salary > 70000 and bonus > 7000;

Program Agenda
Hive Storage Handler Implementation for Oracle
1
2
3

Data
HCatalog
InputFormat
OutputFormat
SerDe
Hadoop 2.0 Architecture – Storage Handler
YARN
HDFS NoSQL
Redundant Storage
Batch
(MapReduce)
Hive SQL Spark
(In-Memory)
Big Data
SQL
External
Table RDBMS
table(s)
Storage
Handler
Mahout
(ML libs)
Compute
Resources
+
Scheduler

Storage Handler Interface
https://cwiki.apache.org/confluence/display/Hive/StorageHandlers
package org.apache.hadoop.hive.ql.metadata;
import java.util.Map;
import org.apache.hadoop.conf.Configurable;
import org.apache.hadoop.hive.metastore.HiveMetaHook;
import org.apache.hadoop.hive.ql.plan.TableDesc;
import org.apache.hadoop.hive.serde2.SerDe;
import org.apache.hadoop.mapred.InputFormat;
import org.apache.hadoop.mapred.OutputFormat;
public interface HiveStorageHandler extends Configurable {
public Class<? extends InputFormat> getInputFormatClass();
public Class<? extends OutputFormat> getOutputFormatClass();
public Class<? extends SerDe> getSerDeClass();
public HiveMetaHook getMetaHook();
public void configureTableJobProperties(
TableDesc tableDesc,
Map<String, String> jobProperties);
}

RDBMS Table as Hive External Table
DDL
CREATE EXTERNAL TABLE Hadoop_employees (
EMPLOYEE_ID INT, FIRST_NAME STRING, LAST_NAME
STRING,SALARY DOUBLE, ...)
STORED BY ‘ RDBMS specific Storage handler class‘
TBLPROPERTIES
( ...
'mapreduce.jdbc.input.table.name' ='EMPLOYEES‘,
...
);

Program Agenda
Hive Storage Handler Implementation for Oracle
1
2
3

Oracle Datasource for Hadoop (OD4H)
Hive
OracleTable
Impala
*
Spark
SQL
Mahout
Other
YARN
HCatalog
StorageHandler
InputFormat
OutputFormat
SerDe
Direct, parallel, fast secure and consistent access to master data

Parallel Access to Oracle Table: Splitter Patterns
• SINGLE_SPLITTER
• ROW_SPLITTER
number of rows set in oracle.hcat.osh.rowsPerSplit
• BLOCK_SPLITTER
max # of splits directed by oracle.hcat.osh.maxStorageBasedSplits
• PARTITION_SPLITTER
• CUSTOM_SPLITTER
a user-defined SELECT statement that emits ROWIDs corresponding to start and end of
each split in oracle.hcat.osh.chunkSQL

Split Pattern for Partitioned Oracle Table
CREATE EXTERNAL TABLE Hadoop_employees (
EMPLOYEE_ID INT, FIRST_NAME STRING, LAST_NAME STRING,
SALARY DOUBLE, HIRE_DATE TIMESTAMP,
JOB_ID STRING)
STORED BY
'oracle.hcat.osh.storagehandler.OracleStorageHandler ‘
TBLPROPERTIES (
'mapreduce.jdbc.url' =
'jdbc:oracle:thin:@localhost:1521:orcl',
'mapreduce.jdbc.username' = ‘foobar',
'mapreduce.jdbc.password' = ‘ dontdothis',
'mapreduce.jdbc.input.table.name' = 'EMPLOYEES',
'oracle.hcat.osh.splitterKind' = ‘PARTITION_SPLITTER'
);

How Oracle Datasource for Hadoop Works
Oracle Confidential
Hive
Query
Hadoop Cluster
Execution
Plan (partial) Oracle
Datasrce
4
Hadoop
1. From TBLPROPERTY in HCatalog, get a
secure connection to DB
2. Generate database Splits with SCN, based on
the query and split pattern
3. For each split, rewrites the sub-query into
Oracle SQL
4. Each split is processed by a Hadoop task
5. Matching rows returned to Hadoop Query
coordinator
Oracle
table

HCatalog
Map Reduce
Putting Everything Together
Oracle
Table
granule
granule
granule
granule
(2) Hive
Query
Oracle
Storage Handler
MapTask
MapTask
MapTask
Job Tracker
split
split
split
split
(1) Hive DDL
Rewritten
Sub-Queries
JDBC
Connections

OD4H Features
• Performance and Scalability
• Resource Management and Consistency
• Security
• High Availability
• Data Types
• OutputFormat: Write back

OD4H - Performance and Scalability
Fully exploit Hadoop clusters and Oracle database servers
• Splitter Patterns
• Optimized JDBC Driver
• Connection Caching
• Integration with Database Resident Connection Pool (DRCP)
• Projection & Predicate Pushdown
• Partition Pruning

OD4H - Resource Management
• MaxSplit
• DRCP
maxconnections
• Hadoop
mapred.tasktracker.map.tasks.maximum in conf/mapred-site.xml
• Spark
spark.dynamicAllocation.enabled

OD4H - Security
• Simple and Strong Authentication
– Username/password
– Wallet
– Kerberos
• Encryption and Integrity
• JVM System Properties
• Hive/Hadoop/Spark environment variables

OD4H – OutputFormat
OD4H allows writing back to Oracle table, the result of a Hive query
INSERT into EmployeeBonusReport
SELECTEmployeeDataSimple.First_Name,EmployeeDataSimple.Last_Name,
EmployeeBonus.bonus FROM EmployeeDataSimple
JOIN EmployeeBonus on
(EmployeeDataSimple.Emp_ID=EmployeeBonus.Emp_ID)
WHERE salary > 70000 and bonus > 7000

Summary
• Support for Hive SQL, Spark-SQL, Impala*
• Support for MapReduce, Pig, etc
• Secure and reliable authentication:
Kerberos authentication, SSL, Oracle
Wallet
• Efficient translation of HQL to Oracle SQL
• Scalability: splits based on DB meta-data
• Column Projection Pushdown
• Predicate Pushdown
• Partition Pruning
• Connection caching
• Consistent Read (SCN)
• Writing back to Oracle
• Free for Oracle Big Data Appliance (BDA)
• Included in Oracle Big Data Cloud Service
& Big Data Cloud Service Compute Edition
• Other Hadoop Cluster: priced as an Oracle
Big Data Connector
Oracle Confidential

Resources
• Oracle Datasource for Hadoop (OD4H)
http://bit.ly/2j1kSIT (landing page, white paper, etc)
Download @ http://bit.ly/2v36Wnf
• Oracle Big Data Connectors
https://www.oracle.com/database/big-data-connectors/index.html
• Big Data Cloud Service
https://cloud.oracle.com/en_US/big-data
• Big Data Cloud Service - Compute Edition
https://cloud.oracle.com/en_US/big-data-compute-edition

Turning Relational Database Tables into Hadoop Datasources by Kuassi Mensah

Turning Relational Database Tables into Hadoop Datasources by Kuassi Mensah

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Turning Relational Database Tables into Hadoop Datasources by Kuassi Mensah

Similar to Turning Relational Database Tables into Hadoop Datasources by Kuassi Mensah (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)

Turning Relational Database Tables into Hadoop Datasources by Kuassi Mensah