There and back_again_oracle_and_big_data_16x9

There and back again.
or how to connect Oracle and Big Data.
SESSION ID
GLEB OTOCHKIN

Gleb Otochkin
Started to work with data in 1992
At Pythian since 2008
Area of expertise:
● Data Integration
● Oracle RAC
● Oracle engineered systems
● Virtualization
● Performance tuning
● Big Data
otochkin@pythian.com
@sky_vst
Principal Consultant
© The Pythian Group Inc., 2017 2

ABOUT PYTHIAN
Pythian’s 400+ IT professionals
help companies adopt and
manage disruptive technologies
to better compete
© 2016 Pythian. Confidential 3

Systems currently
managed by Pythian
EXPERIENCED
Pythian experts
in 35 countries
GLOBAL
Millennia of experience
gathered and shared
over 19 years
EXPERTS
11,800 2400

Journey of our data begins here.

AGENDA
• What do we call by name “Big Data”?
• Business cases.
• Real time replication from Oracle to Big Data.
• How to use Big Data in Oracle.
• What is next.
• QA

Big Data?

CLICK TO ADD
TITLE
Big Data ecosystem
Big Data?
Here in Texas we call it
just Data.

CLICK TO ADD
TITLE
SUBTITLE
© The Pythian Group Inc., 2017
Big Data ecosystem
VOLUME
ANALYSIS
ANY
STRUCTURE
PROCESSINGGROWS
COMPLEX
DISCOVERY
BIG

Some Big Data tools and terms.
Some terms and tools used in Big DATA:
• HDFS - Hadoop Distributed File System.
• Apache Hadoop - framework for distributed storage and processing.
• HBase - non-relational, distributed database.
• Kafka - streaming processing data platform.
• Flume - streaming, aggregation data framework.
• Cassandra - open-source distributed NoSQL database.
• Mongodb - open-source cross-platform document-oriented database.
• Hive - query and analysis data framework on top of Hadoop.
• Avro - data format widely used in BD (JSON+binary)

Business cases

Why do we need replication?
Business cases.

Business cases.
Operation activity on DB with rest on a Data Lake.
RDBMS
OLTP
RDBMS
OLTP
BD platform
Data
Preparation
Engine
BI &
Analysis

Business cases.
Operation activity on DB with Data in Big Data and BI on a RDBMS.
RDBMS
OLTP
RDBMS
OLTP
BD platform
Data
Preparation
Engine
BI &
AnalysisRDBMS

Business cases.
Operation and BI activity on DB with main Data body in a Big Data platform.
RDBMS
OLTP
RDBMS
OLTP
BD platform
BI &
Analysis
ODI & BD SQL

Business cases.
Operation and BI activity on DB with main Data body in a Big Data platform.
RDBMS
OLTP
RDBMS
OLTP
BD platforms
BI &
Analysis
Kafka KafkaKafka
RDBMS

Business cases.
Short summary
• How we connect different platforms:
•Data replication tools.
▪Oracle GoldenGate.
▪DBVisit.
▪Shareplex.
▪ELT tools
•Batch load tools.
▪Sqoop.
▪Oracle loader for Hadoop
•Data Integration tools.
▪Oracle Data Integrator.
•Presenting Big Data to RDBMS:
▪Oracle Big Data SQL.
▪Gluent.
• Reasons:
• Growing Data Volume.
• Unstructured data.
• Data retention policy.
• Cost.

Oracle to big data.
Real time replication OLTP data to a Big Data platform.

Batch offloading from Oracle to Big Data.
To HDFS.
• Sqoop one of the oldes.
• Works in both ways.
• Gluent is more than a just
batch tool.
SQOOP HDFS
HDFS
HDFS
HDFS
HDFS
Database

Replication to Big Data.
Sqoop import.
App
App
App
App Database
App
Sqoop HDFS
HDFS
HDFS
HDFS
HDFS
MapReduceJDBC
Hive
Generate MapReduce Jobs.
Connects through JDBC.
Good for batch processing.

Oracle Goldengate:
• Proc.
•Real time streaming.
•No or minimal impact to the source.
•Supports most of the data types.
•Different formats and conversion.
•Enterprise support
• Cons.
•Licensing fee.
•Closed source code.
Why Goldengate?

Replication to HDFS by Oracle GoldenGate.
App
App
App
App Database Tran
Log
App
Oracle
Goldengate
Oracle
Goldengate
HDFS
HDFS
HDFS
HDFS
HDFS
OGG Trail

Source side.
• Oracle 11.2.0.4 and up
• Archivelog mode.
• Supplemental logging:
•Minimal od DB level.
•On schema or table level.
• OGG user in database.
App
App
App
App Database Tran
Log
App

Oracle GoldenGate.
• OGG 12.2.0.1 and up.
• BD adapters with replicat.
• Supported BD targets:
•HDFS
•Kafka
•Flume
•HBase
•Mongodb (from 12.3)
•Cassandra (from 12.3)
• Different formats.
Tran
Log
Manager
OGG Trail
Extract
OGG Trail
Data Pump
Manager
Replicat

To HDFS.
• OGG 12.2.0.1 and up.
• BD adapters with replicat.
• Different formats.
• DML and DDL.
OGG Trail
Manager
Replicat
HDFS Client
HDFS
HDFS
HDFS
HDFS
HDFS

Test two.
• New columns:
•Operation type.
•Table name
•Local and UTC timestamp
orcl> select * from ggtest.test_tab_2;
ID RND_STR USE_DATE
1 BGBXRKJL 02/13/16 08:34:19
2 FNMCEPWE 08/17/15 04:50:18
hive> select * from BDTEST.TEST_TAB_2;
OK
I BDTEST.TEST_TAB_2 2016-10-25 01:09:16.000168 2016-10-24T21:09:21.186000 00000000120000004759 1 BGBXRKJL
2016-02-13:08:34:19 NULL
I BDTEST.TEST_TAB_2 2016-10-25 01:09:16.000168 2016-10-24T21:09:22.827000 00000000120000004921 2
FNMCEPWE 2015-08-17:04:50:18 NULL

To Kafka.
OGG Trail
Manager
Replicat
Kafka
producer
Kafka
topic
Kafka
topic
Zookeeper
Kafka
Consumer
Kafka
Consumer
Kafka
Consumer
Broker
Stream

Some notes about source:
• JSON,Text,Avro(different types) and
XML formats.
• Some strange behaviour with Avro.
• Hive support through text format.
• Trimming out leading or trailing
whitespaces.
• Truncates only as DML.
• Proper path to Java classes.
• We are replicating a log of changes.
• Only committed transactions.
• “Passive” commit.
• Different levels of supplemental
logging may lead to different
captured data (compressed DML).
• DDL only with following DML.
• CTAS is not supported.
and BD destination:

Some notes about Kafka
handler:
• Different topologies for Kafka.
• Topic partitioning to table.
• Topics are created automatically.
• Operation vs transactional mode.
• Blocking vs Non-Blocking.
• Version 0.9+.
• Schema definition can go to a
different topic(Avro).
and others:

Kafka connect and Kafka stream.
Confluent team and their additions.
• Kafka connect.
• Frame to build connector.
• https://www.confluent.io/product
/connectors/
• Main purpose is copy data to
and from Kafka.
• Kafka Stream.
• Incorporated to Kafka and works
together.
• https://www.confluent.io/product/kaf
ka-streams/
• Input data are transformed to output
data.

To Flume.
OGG Trail
Manager
Replicat
Flume
source
Flume
channel
HDFSFlume sink
Flume agent

Some notes about Flume handler:
• Configuration file path.
• Failover and load balancing (from
1.6.x for Avro).
• Kerberos security (from 1.6.x for
Thrift).
• Version 1.4+. (1.6.x better)
• Avro and Thrift formats.
• Schema changes go to a different
Flume event.
and others:

To HBase. MongoDB. Cassandra.
OGG Trail
Manager
Replicat
Hbase client HDFS
HBASE

Primary key for a source table in HBase.
orcl> alter table ggtest.test_tab_2 add constraint pk_test_tab_2 primary key (pk_id);
Table altered.
orcl> insert into ggtest.test_tab_2 values(9,'PK_TEST',sysdate,null);
• Having supplemental logging for all columns:
•Row id as concatenation of values for all columns.
• Adding a primary key and supplemental logging for keys:
•Row id is primary key.

Two cases. All columns supplemental logging and PK supplemental logging.
hbase(main):012:0> scan 'BDTEST:TEST_TAB_2'
ROW COLUMN+CELL
7|IJWQRO7T|2013-07-07:08:13:52 column=cf:ACC_DATE, timestamp=1459275116849, value=2013-07-07:08:13:52
7|IJWQRO7T|2013-07-07:08:13:52 column=cf:PK_ID, timestamp=1459275116849, value=7
7|IJWQRO7T|2013-07-07:08:13:52 column=cf:RND_STR_1, timestamp=1459275116849, value=IJWQRO7T
8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER column=cf:ACC_DATE, timestamp=1459278884047, value=2016-03-29:15:14:37
8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER column=cf:PK_ID, timestamp=1459278884047, value=8
8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER column=cf:RND_STR_1, timestamp=1459278884047, value=TEST_INS1
8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER column=cf:TEST_COL, timestamp=1459278884047, value=TEST_ALTER
9 column=cf:ACC_DATE, timestamp=1462473865704, value=2016-05-05:14:44:19
9 column=cf:PK_ID, timestamp=1462473865704, value=9
9 column=cf:RND_STR_1, timestamp=1462473865704, value=PK_TEST
9 column=cf:TEST_COL, timestamp=1462473865704, value=NULL

Some notes about HBase handler:
• Hbase-site.xml in the classpath.
• GROUPTRANSOPS batching for
performance.
• Kerberos security (from 1.6.x for
Thrift).
• Version 1.1+. (0.98 with compat
parameter)
• HBase row key mapping.
• Only single column family.
and others:

Some notes about Cassandra
handler:
• Does not allow the _id column to be
modified.
• Undo for bulk transactions (requires
full before image for deletes in OGG).
• Primary key updates are deletes and
inserts (watch compressed updates in
OGG).
• Does not automatically create
keyspaces (but creates tables)
• Primary keys are immutable.
• Inserts and Updates are different
than in traditional databases.
• Primary key updates are deletes and
inserts (watch compressed updates
in OGG).
and Mongodb:

Other tools.
What can be used as an alternative.
• DBvisit.
• Replicate connector to Kafka.
• Confluent version of Kafka.
• http://www.dbvisit.com/replicate
_connector_for_kafka
• Has similar functionality to
Oracle GoldenGate.
• Dell Shareplex.
• Replication to Kafka.
• Supports most of datatypes.
• http://documents.software.dell.com/
SharePlex/8.6.5/release-notes/syst
em-requirements
• https://documents.software.dell.com
/shareplex/8.6.4/administration-guid
e/configure-replication-to-open-targ
et-targets/configure-replication-to-a-
kafka-target

Back again
Using data from Big Data in Oracle

From Big Data to Oracle.
Tool Filtering Data
Conversion
Query Offload
query
Parallel
Sqoop Oracle Oracle Yes
ODBC gateway Hadoop Oracle Yes Yes
Oracle loader
for HDFS
Oracle Hadoop Yes
Oracle SQL
Connector for
HDFS
Oracle Oracle Yes Yes
BD SQL Hadoop Hadoop Yes Yes Yes
ODI KM KM Yes Yes
Gluent Hadoop Hadoop Yes Yes Yes

Oracle Data Integrator.
ODI vs traditional ETL.
Intermediate staging
and transformation
Source I
Source II
Target
Extract Transform Load
Load engine
Source I
Source II Target
Extract TransformLoad
Transform

How it looks.
Weblogic
Staging
schema
Oracle DB I
HDFS
Work
repository
Oracle DB II
CSV text
ODI Agent
Target
schema
Master
repository
ODI Studio

ODI workflow.
Data model
Project I
Designer
Project II
Data model
Data model
Data model
Logical
schema I
Logical view
Logical
schema II
Prod Batch
Context
Prod
Stream
PROD
Physical
PROD
Test
PROD
Stream
PROD
Stream
Test env
Test env

ODI logical mapping
• Logical mapping is
separated from
physical
implementation.

ODI physical mapping.
• Different physical
mapping for the
same logical map
• Different
knowledge
modules.

ODI
• Knowledge modules for most
platforms.
• Uses filters, mappings, joins and
constraints.
• Batch or event(stream) oriented
integration.
• ELT:
• Extract.
• Load.
• Transform.
• Logical and physical models separation.
• Knowledge modules:
• RKM - reverse engineering.
• LKM - load.
• IKM - integration.

Oracle Big Data SQL
Architecture.
Oracle Database Hadoop
CDH or HDP
management server
Hadoop
BD SQL
agent/service
BD SQL
Exadata technology for Big Data

Oracle Big Data SQL.
What it can do for us.
• Smart scan:
• Storage indexes.
• Bloom filters.
• Works with:
• Apache Hive.
• HDFS
• Oracle NoSQL Database.
• Apache HBase
• Fully supports SQL syntax.
• Predicate push down.

Views and packages to use.
• Smart scan:
• Storage
indexes.
• Bloom filters.
orcl> select cluster_id,database_name, owner, table_name from all_hive_tables where database_name='bdtest';
CLUSTER_ID DATABASE_NAME OWNER TABLE_NAME
bigdatalite bdtest oracle test_tab_1
bigdatalite bdtest oracle test_tab_2
PROCEDURE CREATE_EXTDDL_FOR_HIVE
Argument Name Type In/Out Default?
------------------------------ ----------------------- ------ --------
CLUSTER_ID VARCHAR2 IN
DB_NAME VARCHAR2 IN
HIVE_TABLE_NAME VARCHAR2 IN
HIVE_PARTITION PL/SQL BOOLEAN IN
TABLE_NAME VARCHAR2 IN
PERFORM_DDL PL/SQL BOOLEAN IN DEFAULT
TEXT_OF_DDL CLOB OUT

Building the external table.
• Hive table can be queried in
Oracle now.
CREATE TABLE test_tab_2 ( tran_flag VARCHAR2(4000), tab_name VARCHAR2(4000), ………..)
ORGANIZATION EXTERNAL (TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS
( com.oracle.bigdata.cluster=bigdatalite
com.oracle.bigdata.tablename=bdtest.test_tab_2) ) PARALLEL 2 REJECT LIMIT UNLIMITED
TRAN_FLAG ID RND_STR USE_DATE
---------- ---------- ---------- --------------------
I 1 BGBXRKJL 2016-02-13:08:34:19
I 2 FNMCEPWE 2015-08-17:04:50:18
I 1 BGBXRKJL 2016-02-13:08:34:19

Options for BD SQL.
• Oracle Copy to Hadoop utility.
•Using Data Pump format.
•ORACLE_DATAPUMP access driver.
•CTAS to external table.
•hadoop fs -put …
•Hive “INPUTFORMAT
'oracle.hadoop.hive.datapump.DPInputFormat'”
•Oracle Shell for Hadoop Loaders
• ACCESS PARAMETERS :
• com.oracle.bigdata.overflow
• com.oracle.bigdata.tablename
• com.oracle.bigdata.colmap
• com.oracle.bigdata.erroropt
• Directly to HDFS :
• TYPE oracle_hdfs.

Other tools.
What can be used as an alternative.
• Gluent Data Platform.
• Can offload data to Hadoop.
• Allows query data of any data
source in Hadoop.
• Offload engine .
• Supports Cloudera, Hortonworks or
any Hadoop with Impala or or Hive
SQL engine installed.
https://gluent.com/

And here we have returned back.

QA

THANK YOU
Email: otochkin@pythian.com
https://pythian.com/blog
@sky_vst

There and back_again_oracle_and_big_data_16x9

More Related Content

What's hot

Similar to There and back_again_oracle_and_big_data_16x9

Recently uploaded

There and back_again_oracle_and_big_data_16x9