Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors

Deep-Dive into Big Data ETL with
ODI12c and Oracle Big Data Connectors
Mark Rittman, CTO, Rittman Mead
Oracle Openworld 2014, San Francisco
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
About the Speaker
•Mark Rittman, Co-Founder of Rittman Mead
•Oracle ACE Director, specialising in Oracle BI&DW
•14 Years Experience with Oracle Technology
•Regular columnist for Oracle Magazine
•Author of two Oracle Press Oracle BI books
•Oracle Business Intelligence Developers Guide
•Oracle Exalytics Revealed
•Writer for Rittman Mead Blog :
http://www.rittmanmead.com/blog
•Email : mark.rittman@rittmanmead.com
•Twitter : @markrittman

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
About Rittman Mead
•Oracle BI and DW Gold partner
•Winner of five UKOUG Partner of the Year awards in 2013 - including BI
•World leading specialist partner for technical excellence,
solutions delivery and innovation in Oracle BI
•Approximately 80 consultants worldwide
•All expert in Oracle BI and DW
•Offices in US (Atlanta), Europe, Australia and India
•Skills in broad range of supporting Oracle tools:
‣OBIEE, OBIA
‣ODIEE
‣Essbase, Oracle OLAP
‣GoldenGate
‣Endeca

Traditional Data Warehouse / BI Architectures
•Three-layer architecture - staging, foundation and access/performance
•All three layers stored in a relational database (Oracle)
•ETL used to move data from layer-to-layer
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
Staging Foundation /
ODS
Performance /
Dimensional
ETL ETL
BI Tool (OBIEE)
with metadata
layer
OLAP / In-Memory
Tool with data load
into own database
Direct
Read
Data
Load
Traditional structured
data sources
Data
Load
Data
Load
Data
Load
Traditional Relational Data Warehouse

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
Introducing Hadoop
•A new approach to data processing and data storage
•Rather than a small number of large, powerful servers, it spreads processing over
large numbers of small, cheap, redundant servers
•Spreads the data you’re processing over
lots of distributed nodes
•Has scheduling/workload process that sends
Job Tracker
parts of a job to each of the nodes
- a bit like Oracle Parallel Execution
•And does the processing where the data sits
- a bit like Exadata storage servers
•Shared-nothing architecture
•Low-cost and highly horizontal scalable
Task Tracker Task Tracker Task Tracker Task Tracker
Data Node Data Node Task Tracker Task Tracker

Hadoop Tenets : Simplified Distributed Processing
•Hadoop, through MapReduce, breaks processing down into simple stages
‣Map : select the columns and values you’re interested in, pass through as key/value pairs
‣Reduce : aggregate the results
•Most ETL jobs can be broken down into filtering,
projecting and aggregating
•Hadoop then automatically runs job on cluster
‣Share-nothing small chunks of work
‣Run the job on the node where the data is
‣Handle faults etc
‣Gather the results back in
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
Mapper
Filter, Project
Mapper
Filter, Project
Mapper
Filter, Project
Reducer
Aggregate
Reducer
Aggregate
Output
One HDFS file per reducer,
in a directory

HDFS: Low-Cost, Clustered, Fault-Tolerant Storage
•The filesystem behind Hadoop, used to store data for Hadoop analysis
‣Unix-like, uses commands such as ls, mkdir, chown, chmod
•Fault-tolerant, with rapid fault detection and recovery
•High-throughput, with streaming data access and large block sizes
•Designed for data-locality, placing data closed to where it is processed
•Accessed from the command-line, via internet (hdfs://), GUI tools etc
[oracle@bigdatalite mapreduce]$ hadoop fs -mkdir /user/oracle/my_stuff
[oracle@bigdatalite mapreduce]$ hadoop fs -ls /user/oracle
Found 5 items
drwx------ - oracle hadoop 0 2013-04-27 16:48 /user/oracle/.staging
drwxrwxrwx - oracle hadoop 0 2012-09-18 17:02 /user/oracle/moviedemo
drwxrwxrwx - oracle hadoop 0 2012-10-17 15:58 /user/oracle/moviework
drwxrwxrwx - oracle hadoop 0 2013-05-03 17:49 /user/oracle/my_stuff
drwxrwxrwx - oracle hadoop 0 2012-08-10 16:08 /user/oracle/stage
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Oracle’s Big Data Products
•Oracle Big Data Appliance - Engineered System for Big Data Acquisition and Processing
‣Cloudera Distribution of Hadoop
‣Cloudera Manager
‣Open-source R
‣Oracle NoSQL Database Community Edition
‣Oracle Enterprise Linux + Oracle JVM
‣New - Oracle Big Data SQL
•Oracle Big Data Connectors
‣Oracle Loader for Hadoop (Hadoop > Oracle RBDMS)
‣Oracle Direct Connector for HDFS (HFDS > Oracle RBDMS)
‣Oracle Data Integration Adapter for Hadoop
‣Oracle R Connector for Hadoop
‣Oracle NoSQL Database (column/key-store DB based on BerkeleyDB)
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Moving Data In, Around and Out of Hadoop
•Three stages to Hadoop ETL work, with dedicated Apache / other tools
‣Load : receive files in batch, or in real-time (logs, events)
‣Transform : process & transform data to answer questions
‣Store / Export : store in structured form, or export to RDBMS using Sqoop
RDBMS
Imports
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
Loading
Stage
!!!!
Processing
Stage
!!!!
Store / Export
Stage
!!!!
Real-Time
Logs / Events
File /
Unstructured
Imports
File
Exports
RDBMS
Exports

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
“ETL Offloading”
•Special use-case : offloading low-value, simple ETL work to a Hadoop cluster
‣Receiving, aggregating, filtering and pre-processing data for an RDBMS data warehouse
‣Potentially free-up high-value Exadata / RBDMS servers for analytic work

Core Apache Hadoop Tools
•Apache Hadoop, including MapReduce and HDFS
‣Scaleable, fault-tolerant file storage for Hadoop
‣Parallel programming framework for Hadoop
•Apache Hive
‣SQL abstraction layer over HDFS
‣Perform set-based ETL within Hadoop
•Apache Pig, Spark
‣Dataflow-type languages over HDFS, Hive etc
‣Extensible through UDFs, streaming etc
•Apache Flume, Apache Sqoop, Apache Kafka
‣Real-time and batch loading into HDFS
‣Modular, fault-tolerant, wide source/target coverage
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Hive as the Hadoop “Data Warehouse”
•MapReduce jobs are typically written in Java, but Hive can make this simpler
•Hive is a query environment over Hadoop/MapReduce to support SQL-like queries
•Hive server accepts HiveQL queries via HiveODBC or HiveJDBC, automatically
creates MapReduce jobs against data previously loaded into the Hive HDFS tables
•Approach used by ODI and OBIEE
to gain access to Hadoop data
•Allows Hadoop data to be accessed just like
any other data source (sort of...)
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

How Hive Provides SQL Access over Hadoop
•Hive uses a RBDMS metastore to hold
table and column definitions in schemas
•Hive tables then map onto HDFS-stored files
‣Managed tables
‣External tables
•Oracle-like query optimizer, compiler,
executor
•JDBC and OBDC drivers,
plus CLI etc
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
Hive Driver
(Compile
Optimize, Execute)
Managed Tables
/user/hive/warehouse/
External Tables
/user/oracle/
/user/movies/data/
HDFS
HDFS or local files
loaded into Hive HDFS
area, using HiveQL
CREATE TABLE
command
HDFS files loaded into HDFS
using external process, then
mapped into Hive using
CREATE EXTERNAL TABLE
command
Metastore

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
Oracle Loader for Hadoop
•Oracle technology for accessing Hadoop data, and loading it into an Oracle database
•Pushes data transformation, “heavy lifting” to the Hadoop cluster, using MapReduce
•Direct-path loads into Oracle Database, partitioned and non-partitioned
•Online and offline loads
•Key technology for fast load of
Hadoop results into Oracle DB

Oracle Direct Connector for HDFS
•Enables HDFS as a data-source for Oracle Database external tables
•Effectively provides Oracle SQL access over HDFS
•Supports data query, or import into Oracle DB
•Treat HDFS-stored files in the same way as regular files
‣But with HDFS’s low-cost
‣… and fault-tolerance
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Oracle R Advanced Analytics for Hadoop
•Add-in to R that extends capability to Hadoop
•Gives R the ability to create Map and Reduce functions
•Extends R data frames to include Hive tables
‣Automatically run R functions on Hadoop
by using Hive tables as source
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Just Released - Oracle Big Data SQL
•Part of Oracle Big Data 4.0 (BDA-only)
‣Also requires Oracle Database 12c, Oracle Exadata Database Machine
•Extends Oracle Data Dictionary to cover Hive
•Extends Oracle SQL and SmartScan to Hadoop
•Extends Oracle Security Model over Hadoop
‣Fine-grained access control
‣Data redaction, data masking
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
Exadata
Storage Servers
Exadata Database
Server
Hadoop
Cluster
Oracle Big
Data SQL
SQL Queries
SmartScan SmartScan

Bringing it All Together : Oracle Data Integrator 12c
•ODI provides an excellent framework for running Hadoop ETL jobs
‣ELT approach pushes transformations down to Hadoop - leveraging power of cluster
•Hive, HBase, Sqoop and OLH/ODCH KMs provide native Hadoop loading / transformation
‣Whilst still preserving RDBMS push-down
‣Extensible to cover Pig, Spark etc
•Process orchestration
•Data quality / error handling
•Metadata and model-driven
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

The Key to ODI Extensibility - Knowledge Modules
•Divides the ETL process into separate steps - extract (load), integrate, check constraints etc
•ODI generates native code for each platform, taking a template for each step + adding
table names, column names, join conditions etc
‣Easy to extend
‣Easy to read the code
‣Makes it possible for ODI to
support Spark, Pig etc in future
‣Uses the power of the target
platform for integration tasks
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
-Hadoop-native ETL

Part of the Wider Oracle Data Integration Platform
•Oracle Data Integrator for large-scale data integration across heterogenous sources and
targets
•Oracle GoldenGate for heterogeneous data replication and changed data capture
•Oracle Enterprise Data Quality for data profiling and cleansing
•Oracle Data Services Integrator
for SOA message-based
data federation
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

ODI and Big Data Integration Example
•In this example, we’ll show an end-to-end ETL process on Hadoop using ODI12c & BDA
•Scenario: load webserver log data into Hadoop, process enhance and aggregate,
then load final summary table into Oracle Database 12c
‣Process using Hadoop framework
‣Leverage Big Data Connectors
‣Metadata-based ETL development
using ODI12c
‣Real-world example
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

ETL & Data Flow through BDA System
•Five-step process to load, transform, aggregate and filter incoming log data
•Leverage ODI’s capabilities where possible
•Make use of Hadoop power
+ scalability
Flume
Agent
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
Sqoop extract
!
posts
(Hive Table)
IKM Hive Control Append
(Hive table join & load into
target hive table)
categories_sql_
extract
(Hive Table)
hive_raw_apache_
access_log
(Hive Table)
Flume
Agent
!!!!!!
Apache HTTP
Server
Log Files (HDFS)
Flume Messaging
TCP Port 4545
(example)
IKM File to Hive
1 using RegEx SerDe
log_entries_
and post_detail
(Hive Table)
IKM Hive Control Append
(Hive table join & load into
target hive table)
hive_raw_apache_
access_log
(Hive Table)
2 3
Geocoding
IP>Country list
(Hive Table)
IKM Hive Transform
(Hive streaming through
Python script)
4 5
hive_raw_apache_
access_log
(Hive Table)
IKM File / Hive to Oracle
(bulk unload to Oracle DB)

ETL Considerations : Using Hive vs. Regular Oracle SQL
•Not all join types are available in Hive - joins must be equality joins
•No sequences, no primary keys on tables
•Generally need to stage Oracle or other external data into Hive before joining to it
•Hive latency - not good for small microbatch-type work
‣But other alternatives exist - Spark, Impala etc
•Hive is INSERT / APPEND only - no updates, deletes etc
‣But HBase may be suitable for CRUD-type loading
•Don’t assume that HiveQL == Oracle SQL
‣Test assumptions before committing to platform
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
vs.

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
Five-Step ETL Process
1. Take the incoming log files (via Flume) and load into a structured Hive table
2. Enhance data from that table to include details on authors, posts from other Hive tables
3. Join to some additional ref. data held in an Oracle database, to add author details
4. Geocode the log data, so that we have the country for each calling IP address
5. Output the data in summary form to an Oracle database

Using Flume to Transport Log Files to BDA
•Apache Flume is the standard way to transport log files from source through to target
•Initial use-case was webserver log files, but can transport any file from A>B
•Does not do data transformation, but can send to multiple targets / target types
•Mechanisms and checks to ensure successful transport of entries
•Has a concept of “agents”, “sinks” and “channels”
•Agents collect and forward log data
•Sinks store it in final destination
•Channels store log data en-route
•Simple configuration through INI files
•Handled outside of ODI12c
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

GoldenGate for Continuous Streaming to Hadoop
•Oracle GoldenGate is also an option, for streaming RDBMS transactions to Hadoop
•Leverages GoldenGate & HDFS / Hive Java APIs
•Sample Implementations on MOS Doc.ID 1586210.1 (HDFS) and 1586188.1 (Hive)
•Likely to be formal part of GoldenGate in future release - but usable now
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Load Incoming Log Files into Hive Table
•First step in process is to load the incoming log files into a Hive table
‣Also need to parse the log entries to extract request, date, IP address etc columns
‣Hive table can then easily be used in
downstream transformations
•Use IKM File to Hive (LOAD DATA) KM
‣Source can be local files or HDFS
‣Either load file into Hive HDFS area,
or leave as external Hive table
‣Ability to use SerDe to parse file data
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
1

First Though … Need to Setup Topology and Models
•HDFS data servers (source) defined using generic File technology
•Workaround to support IKM Hive Control Append
•Leave JDBC driver blank, put HDFS URL in JDBC URL field
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Defining Physical Schema and Model for HDFS Directory
•Hadoop processes typically access a whole directory of files in HDFS, rather than single one
•Hive, Pig etc aggregate all files in that directory and treat as single file
•ODI Models usually point to a single file though -
how do you set up access correctly?
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Defining Topology and Model for Hive Sources
•Hive supported “out-of-the-box” with ODI12c (but requires ODIAAH license for KMs)
•Most recent Hadoop distributions use HiveServer2 rather than HiveServer
•Need to ensure JDBC drivers support Hive version
•Use correct JDBC URL format (jdbc:hive2//…)
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Final Model and Datastore Definitions
•HDFS files for incoming log data, and any other input data
•Hive tables for ETL targets and downstream processing
•Use RKM Hive to reverse-engineer column definition from Hive
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Using IKM File to Hive to Load Web Log File Data into Hive
•Create mapping to load file source (single column for weblog entries) into Hive table
•Target Hive table should have column for incoming log row, and parsed columns
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Specifying a SerDe to Parse Incoming Hive Data
•SerDe (Serializer-Deserializer) interfaces give Hive the ability to process new file formats
•Distributed as JAR file, gives Hive ability to parse semi-structured formats
•We can use the RegEx SerDe to parse the Apache CombinedLogFormat file into columns
•Enabled through OVERRIDE_ROW_FORMAT IKM File to Hive (LOAD DATA) KM option
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Executing First ODI12c Mapping
•EXTERNAL_TABLE option chosen in IKM File to Hive (LOAD DATA) as Flume will continue
writing to it until source log rotate
•View results of data load in ODI Studio
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Join to Additional Hive Tables, Transform using HiveQL
•IKM Hive to Hive Control Append can be used to perform Hive table joins, filtering, agg. etc.
•INSERT only, no DELETE, UPDATE etc
•Not all ODI12c mapping operators supported, but basic functionality works OK
•Use this KM to join to other Hive tables,
adding more details on post, title etc
•Perform DISTINCT on join output, load
into summary Hive table
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
2

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
Joining Hive Tables
•Only equi-joins supported
•Must use ANSI syntax
•More complex joins may not produce
valid HiveQL (subqueries etc)

Filtering, Aggregating and Transforming Within Hive
•Aggregate (GROUP BY), DISTINCT, FILTER, EXPRESSION, JOIN, SORT etc mapping
operators can be added to mapping to manipulate data
•Generates HiveQL functions, clauses etc
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Executing Second Mapping
•ODI IKM Hive to Hive Control Append generates HiveQL to perform data loading
•In the background, Hive on BDA creates MapReduce job(s) to load and transform HDFS data
•Automatically runs across the cluster, in parallel and with fault tolerance, HA
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Bring in Reference Data from Oracle Database
•In this third step, additional reference data from Oracle Database needs to be added
•In theory, should be able to add Oracle-sourced datastores to mapping and join as usual
•But … Oracle / JDBC-generic LKMs don’t get work with Hive
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
3

Options for Importing Oracle / RDBMS Data into Hadoop
•Could export RBDMS data to file, and load using IKM File to Hive
•Oracle Big Data Connectors only export to Oracle, not import to Hadoop
•Best option is to use Apache Sqoop, and new
IKM SQL to Hive-HBase-File knowledge module
•Hadoop-native, automatically runs in parallel
•Uses native JDBC drivers, or OraOop (for example)
•Bi-directional in-and-out of Hadoop to RDBMS
•Run from OS command-line
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Loading RDBMS Data into Hive using Sqoop
•First step is to stage Oracle data into equivalent Hive table
•Use special LKM SQL Multi-Connect Global load knowledge module for Oracle source
‣Passes responsibility for load (extract) to following IKM
•Then use IKM SQL to Hive-HBase-File (Sqoop) to load the Hive table
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Join Oracle-Sourced Hive Table to Existing Hive Table
•Oracle-sourced reference data in Hive can then be joined to existing Hive table as normal
•Filters, aggregation operators etc can be added to mapping if required
•Use IKM Hive Control Append as integration KM
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

ODI Static and Flow Control : Data Quality and Error Handling
•CKM Hive can be used with IKM Hive to Hive Control Append to filter out erroneous data
•Static controls can be used to create “data firewalls”
•Flow control used in Physical mapping view to handle errors, exceptions
•Example: Filter out rows where IP address is from a test harness
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Enabling Flow Control in IKM Hive to Hive Control Append
•Check the ENABLE_FLOW_CONTROL option in KM settings
•Select CKM Hive as the check knowledge module
•Erroneous rows will get moved to E_ table in Hive, not loaded into target Hive table
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Using Hive Streaming and Python for Geocoding Data
•Another requirement we have is to “geocode” the webserver log entries
•Allows us to aggregate page views by country
•Based on the fact that IP ranges can usually be attributed to specific countries
•Not functionality normally found in Hive etc, but can be done with add-on APIs
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
4

How GeoIP Geocoding Works
•Uses free Geocoding API and database from Maxmind
•Convert IP address to an integer
•Find which integer range our IP address sits within
•But Hive can’t use BETWEEN in a join…
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Solution : IKM Hive Transform
•IKM Hive Transform can pass the output of a Hive SELECT statement through
a perl, python, shell etc script to transform content
•Uses Hive TRANSFORM … USING … AS functionality
hive> add file file:///tmp/add_countries.py;
Added resource: file:///tmp/add_countries.py
hive> select transform (hostname,request_date,post_id,title,author,category)
> using 'add_countries.py'
> as (hostname,request_date,post_id,title,author,category,country)
> from access_per_post_categories;
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Creating the Python Script for Hive Streaming
•Solution requires a Python API to be installed on all Hadoop nodes, along with geocode DB
wget !
https://raw.github.com/pypa/pip/master/contrib/get-pip.py
python !
get-pip.py pip
install pygeoip
!
•Python script then parses incoming stdin lines using tab-separation of fields, outputs same
(but with extra field for the country)
#!/usr/bin/python
import sys
sys.path.append('/usr/lib/python2.6/site-packages/')
import pygeoip
gi = pygeoip.GeoIP('/tmp/GeoIP.dat')
for line in sys.stdin:
line = line.rstrip()
hostname,request_date,post_id,title,author,category = line.split('t')
country = gi.country_name_by_addr(hostname)
print hostname+'t'+request_date+'t'+post_id+'t'+title+'t'+author
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+'t'+country+'t'+category

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
Setting up the Mapping
•Map source Hive table to target, which includes column for extra “country” column
!
!
!
!
!
!
!
•Copy script + GeoIP.dat file to every node’s /tmp directory
•Ensure all Python APIs and libraries are installed on each Hadoop node

Configuring IKM Hive Transform
•TRANSFORM_SCRIPT_NAME specifies name of
script, and path to script
•TRANSFORM_SCRIPT has issues with parsing;
do not use, leave blank and KM will use existing one
•Optional ability to specify sort and distribution
columns (can be compound)
•Leave other options at default
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
Executing the Mapping
•KM automatically registers the script with Hive (which caches it on all nodes)
•HiveQL output then runs the contents of the first Hive table through the script, outputting
results to target table

Bulk Unload Summary Data to Oracle Database
•Final requirement is to unload final Hive table contents to Oracle Database
•Several use-cases for this:
•Use Hadoop / BDA for ETL offloading
•Use analysis capabilities of BDA, but then output results to RDBMS data mart or DW
•Permit use of more advanced SQL query tools
•Share results with other applications
•Can use Sqoop for this, or use Oracle Big Data Connectors
•Fast bulk unload, or transparent Oracle access to Hive
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
5

IKM File/Hive to Oracle (OLH/ODCH)
•KM for accessing HDFS/Hive data from Oracle
•Either sets up ODCH connectivity, or bulk-unloads via OLH
•Map from HDFS or Hive source to Oracle tables (via Oracle technology in Topology)
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Configuring the KM Physical Settings
•For the access table in Physical view, change LKM to LKM SQL Multi-Connect
•Delegates the multi-connect capabilities to the downstream node, so you can use a multi-connect
IKM such as IKM File/Hive to Oracle
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

Configuring the KM Physical Settings
•For the target table, select IKM File/Hive to Oracle
•Only becomes available to select once
LKM SQL Multi-Connect selected for access table
•Key option values to set are:
•OLH_OUTPUT_MODE (use JDBC initially, OCI
if Oracle Client installed on Hadoop client node)
•MAPRED_OUTPUT_BASE_DIR (set to directory
on HFDS that OS user running ODI can access)
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
Executing the Mapping
•Executing the mapping will invoke
OLH from the OS command line
•Hive table (or HDFS file) contents
copied to Oracle table

Create Package to Sequence ETL Steps
•Define package (or load plan) within ODI12c to orchestrate the process
•Call package / load plan execution from command-line, web service call, or schedule
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
Execute Overall Package
•Each step executed in sequence
•End-to-end ETL process, using ODI12c’s metadata-driven development process,
data quality handing, heterogenous connectivity, but Hadoop-native processing

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
Conclusions
•Hadoop, and the Oracle Big Data Appliance, is an excellent platform for data capture,
analysis and processing
•Hadoop tools such as Hive, Sqoop, MapReduce and Pig provide means to process and
analyse data in parallel, using languages + approach familiar to Oracle developers
•ODI12c provides several benefits when working with ETL and data loading on Hadoop
‣Metadata-driven design; data quality handling; KMs to handle technical complexity
•Oracle Data Integrator Adapter for Hadoop provides several KMs for Hadoop sources
•In this presentation, we’ve seen an end-to-end example of big data ETL using ODI
‣The power of Hadoop and BDA, with the ETL orchestration of ODI12c

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
Thank You for Attending!
•Thank you for attending this presentation, and more information can be found at http://
www.rittmanmead.com
•Contact us at info@rittmanmead.com or mark.rittman@rittmanmead.com
•Look out for our book, “Oracle Business Intelligence Developers Guide” out now!
•Follow-us on Twitter (@rittmanmead) or Facebook (facebook.com/rittmanmead)

Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors

Similar to Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors (20)

More from Mark Rittman

More from Mark Rittman (20)

Recently uploaded

Recently uploaded (20)

Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors