SlideShare a Scribd company logo
1
2
The simplest way to access external files or external data on a file system from within an Oracle
database is through an external table. See here for an introduction to External tables.
External tables present data stored in a file system in a table format and can be used in SQL
queries transparently. External tables could thus potentially be used to access data stored in
HDFS (the Hadoop File System) from inside the Oracle database. Unfortunately HDFS files are
not directly accessible through the normal operating system calls that the external table driver
relies on. The FUSE (File system in Userspace) project provides a workaround in this case. There
are a number of FUSE drivers that allow users to mount a HDFS store and treat it like a normal
file system. By using one of these drivers and mounting HDFS on the database instance (on
every instance if this was a RAC database), HDFS files can be easily accessed using the External
Table infrastructure.
Figure 1. Accessing via External Tables with in-database MapReduce
In Figure 1we are utilizing Oracle Database 11g to implement in-database mapreduce as
described in this article. In general, the parallel execution framework in Oracle Database 11g is
sufficient to run most of the desired operations in parallel directly from the external table.
The external table approach may not be suitable in some cases (say if FUSE is unavailable).
Oracle Table Functions provide an alternate way to fetch data from Hadoop. Our attached
example outlines one way of doing this. At a high level we implement a table function that uses
3
the DBMS_SCHEDULER framework to asynchronously launch an external shell script that
submits a Hadoop Map-Reduce job. The table function and the mapper communicate using
Oracle's Advanced Queuing feature. The Hadoop mapper en-queue's data into a common queue
while the table function de-queues data from it. Since this table function can be run in parallel
additional logic is used to ensure that only one of the slaves submits the External Job.
Figure 2. Leveraging Table Functions for parallel processing
The queue gives us load balancing since the table function could run in parallel while the Hadoop
streaming job will also run in parallel with a different degree of parallelism and outside the
control of Oracle's Query Coordinator.
4
As an example we translated the architecture shown in Figure 2 in a real example. Note that our
example only shows a template implementation of using a Table Function to access data stored
in Hadoop. Other, possibly better, implementations are clearly possible.
The following diagrams are a technically more accurate and more detailed representation of the
original schematic in Figure 2 explaining where and how we use the pieces of actual code that
follow:
Figure 3. Starting the Mapper jobs and retrieving data
In step 1 we figure out who gets to be the query coordinator. For this we use a simple
mechanism that writes records with the same key value into a table. The first insert wins and will
function as the query coordinator (QC) for the process. Note that the QC table function
invocation does play a processing role as well.
In step 2 this table function invocation (QC) starts an asynchronous job using dbms_scheduler –
the Job Controller in Figure 3 – that than runs the synchronous bash script on the Hadoop
cluster. This bash script, called the launcher in Figure 3 starts the mapper processes (step 3) on
the Hadoop cluster.
5
The mapper processes process data and write to a queue in step 5. In the current example we
chose a queue as it is available cluster wide. For now we simply chose to write any output directly
into the queue. You can achieve better performance by either batching up the outputs and then
moving them into the queue. Obviously you can choose various other delivery mechanisms,
including pipes and relational tables.
Step 6 is then the de-queuing process which is done by parallel invocations of the table function
running in the database. As these parallel invocations process data it gets served up to the query
that is requesting the data. The table function leverages both the Oracle data and the data from
the queue and thereby integrates data from both sources in a single result set for the end user(s).
Figure 4. Monitoring the process
After the Hadoop side processes (the mappers) are kicked off, the job monitor process keeps an
eye on the launcher script. Once the mappers have finished processing data on the Hadoop
cluster, the bash script finishes as is shown in Figure 4.
The job monitor monitors a database scheduler queue and will notice that the shell script has
finished (step 7). The job monitor checks the data queue for remaining data elements, step 8. As
long as data is present in the queue the table function invocations keep on processing that data
(step 6).
6
Figure 5. Closing down the processing
When the queue is fully de-queued by the parallel invocations of the table function, the job
monitor terminates the queue (step 9 in Figure 5) ensuring the table function invocations in
Oracle stop. At that point all the data has been delivered to the requesting query.
7
The solution implemented in Figure 3 through Figure 5 uses the following code. All code is
tested on Oracle Database 11g and a 5 node Hadoop cluster. As with most whitepaper text, copy
these scripts into a text editor and ensure the formatting is correct.
This script contains certain set up components. For example the initial part of the script shows
the creation of the arbitrator table in step 1 of Figure 3. In the example we use the always
popular OE schema.
connect oe/oe
-- Table to use as locking mechanisim for the hdfs reader as
-- leveraged in Figure 3 step 1
DROP TABLE run_hdfs_read;
CREATE TABLE run_hdfs_read(pk_id number, status varchar2(100),
primary key (pk_id));
-- Object type used for AQ that receives the data
CREATE OR REPLACE TYPE hadoop_row_obj AS OBJECT (
a number,
b number);
/
connect / as sysdba
-- system job to launch external script
-- this job is used to eventually run the bash script
-- described in Figure 3 step 3
CREATE OR REPLACE PROCEDURE launch_hadoop_job_async(in_directory
IN VARCHAR2, id number) IS
cnt number;
BEGIN
begin DBMS_SCHEDULER.DROP_JOB ('ExtScript'||id, TRUE);
8
exception when others then null; end;
-- Run a script
DBMS_SCHEDULER.CREATE_JOB (
job_name => 'ExtScript' || id,
job_type => 'EXECUTABLE',
job_action => '/bin/bash',
number_of_arguments => 1
);
DBMS_SCHEDULER.SET_JOB_ARGUMENT_VALUE ('ExtScript' || id, 1,
in_directory);
DBMS_SCHEDULER.ENABLE('ExtScript' || id);
-- Wait till the job is done. This ensures the hadoop job is
completed
loop
select count(*) into cnt
from DBA_SCHEDULER_JOBS where job_name = 'EXTSCRIPT'||id;
dbms_output.put_line('Scheduler Count is '||cnt);
if (cnt = 0) then
exit;
else
dbms_lock.sleep(5);
end if;
end loop;
-- Wait till the queue is empty and then drop it
-- as shown in Figure 5
-- The TF will get an exception and it will finish quietly
loop
9
select sum(c) into cnt
from
(
select enqueued_msgs - dequeued_msgs c
from gv$persistent_queues where queue_name =
'HADOOP_MR_QUEUE'
union all
select num_msgs+spill_msgs c
from gv$buffered_queues where queue_name =
'HADOOP_MR_QUEUE'
union all
select 0 c from dual
);
if (cnt = 0) then
-- Queue is done. stop it.
DBMS_AQADM.STOP_QUEUE ('HADOOP_MR_QUEUE');
DBMS_AQADM.DROP_QUEUE ('HADOOP_MR_QUEUE');
return;
else
-- Wait for a while
dbms_lock.sleep(5);
end if;
end loop;
END;
/
-- Grants needed to make hadoop reader package work
grant execute on launch_hadoop_job_async to oe;
10
grant select on v_$session to oe;
grant select on v_$instance to oe;
grant select on v_$px_process to oe;
grant execute on dbms_aqadm to oe;
grant execute on dbms_aq to oe;
connect oe/oe
-- Simple reader package to read a file containing two numbers
CREATE OR REPLACE PACKAGE hdfs_reader IS
-- Return type of pl/sql table function
TYPE return_rows_t IS TABLE OF hadoop_row_obj;
-- Checks if current invocation is serial
FUNCTION is_serial RETURN BOOLEAN;
-- Function to actually launch a Hadoop job
FUNCTION launch_hadoop_job(in_directory IN VARCHAR2, id in out
number) RETURN BOOLEAN;
-- Tf to read from Hadoop
-- This is the main processing code reading from the queue in
-- Figure 3 step 6. It also contains the code to insert into
-- the table in Figure 3 step 1
FUNCTION read_from_hdfs_file(pcur IN SYS_REFCURSOR, in_directory
IN VARCHAR2) RETURN return_rows_t
PIPELINED PARALLEL_ENABLE(PARTITION pcur BY ANY);
END;
/
11
CREATE OR REPLACE PACKAGE BODY hdfs_reader IS
-- Checks if current process is a px_process
FUNCTION is_serial RETURN BOOLEAN IS
c NUMBER;
BEGIN
SELECT COUNT (*) into c FROM v$px_process WHERE sid =
SYS_CONTEXT('USERENV','SESSIONID');
IF c <> 0 THEN
RETURN false;
ELSE
RETURN true;
END IF;
exception when others then
RAISE;
END;
FUNCTION launch_hadoop_job(in_directory IN VARCHAR2, id IN OUT
NUMBER) RETURN BOOLEAN IS
PRAGMA AUTONOMOUS_TRANSACTION;
instance_id NUMBER;
jname varchar2(4000);
BEGIN
if is_serial then
-- Get id by mixing instance # and session id
id := SYS_CONTEXT('USERENV', 'SESSIONID');
SELECT instance_number INTO instance_id FROM v$instance;
id := instance_id * 100000 + id;
else
-- Get id of the QC
12
SELECT ownerid into id from v$session where sid =
SYS_CONTEXT('USERENV', 'SESSIONID');
end if;
-- Create a row to 'lock' it so only one person does the job
-- schedule. Everyone else will get an exception
-- This is in Figure 3 step 1
INSERT INTO run_hdfs_read VALUES(id, 'RUNNING');
jname := 'Launch_hadoop_job_async';
-- Launch a job to start the hadoop job
DBMS_SCHEDULER.CREATE_JOB (
job_name => jname,
job_type => 'STORED_PROCEDURE',
job_action => 'sys.launch_hadoop_job_async',
number_of_arguments => 2
);
DBMS_SCHEDULER.SET_JOB_ARGUMENT_VALUE (jname, 1, in_directory);
DBMS_SCHEDULER.SET_JOB_ARGUMENT_VALUE (jname, 2, CAST (id AS
VARCHAR2));
DBMS_SCHEDULER.ENABLE('Launch_hadoop_job_async');
COMMIT;
RETURN true;
EXCEPTION
-- one of my siblings launched the job. Get out quitely
WHEN dup_val_on_index THEN
dbms_output.put_line('dup value exception');
RETURN false;
WHEN OTHERs THEN
RAISE;
END;
13
FUNCTION read_from_hdfs_file(pcur IN SYS_REFCURSOR, in_directory
IN VARCHAR2) RETURN return_rows_t
PIPELINED PARALLEL_ENABLE(PARTITION pcur BY ANY)
IS
PRAGMA AUTONOMOUS_TRANSACTION;
cleanup BOOLEAN;
payload hadoop_row_obj;
id NUMBER;
dopt dbms_aq.dequeue_options_t;
mprop dbms_aq.message_properties_t;
msgid raw(100);
BEGIN
-- Launch a job to kick off the hadoop job
cleanup := launch_hadoop_job(in_directory, id);
dopt.visibility := DBMS_AQ.IMMEDIATE;
dopt.delivery_mode := DBMS_AQ.BUFFERED;
loop
payload := NULL;
-- Get next row
DBMS_AQ.DEQUEUE('HADOOP_MR_QUEUE', dopt, mprop, payload,
msgid);
commit;
pipe row(payload);
end loop;
exception when others then
if cleanup then
delete run_hdfs_read where pk_id = id;
commit;
end if;
14
END;
END;
/
This short script is the outside-of-the-database controller as shown in Figure 3 steps 3 and 4.
This is a synchronous step remaining on the system for as long as the Hadoop mappers are
running.
#!/bin/bash
cd –HADOOP_HOME-
A="/net/scratch/java/jdk1.6.0_16/bin/java -classpath
/home/hadoop:/home/hadoop/ojdbc6.jar StreamingEq"
bin/hadoop fs -rmr output
bin/hadoop jar ./contrib/streaming/hadoop-0.20.0-streaming.jar -
input input/nolist.txt -output output -mapper "$A" -jobconf
mapred.reduce.tasks=0
For this example we wrote a small and simple mapper process to be executed on our Hadoop
cluster. Undoubtedly more comprehensive mappers exist. This one converts a string into two
numbers and provides those to the queue in a row-by-row fashion.
// Simplified mapper example for Hadoop cluster
import java.sql.*;
//import oracle.jdbc.*;
//import oracle.sql.*;
import oracle.jdbc.pool.*;
//import java.util.Arrays;
//import oracle.sql.ARRAY;
//import oracle.sql.ArrayDescriptor;
15
public class StreamingEq {
public static void main (String args[]) throws Exception
{
// connect through driver
OracleDataSource ods = new OracleDataSource();
Connection conn = null;
Statement stmt = null;
ResultSet rset = null;
java.io.BufferedReader stdin;
String line;
int countArr = 0, i;
oracle.sql.ARRAY sqlArray;
oracle.sql.ArrayDescriptor arrDesc;
try {
ods.setURL("jdbc:oracle:thin:oe/oe@$ORACLE_INSTANCE
");
conn = ods.getConnection();
}
catch (Exception e){
System.out.println("Got exception " + e);
if (rset != null) rset.close();
if (conn != null) conn.close();
System.exit(0);
return ;
}
System.out.println("connection works conn " + conn);
stdin = new java.io.BufferedReader(new
java.io.InputStreamReader(System.in));
16
String query = "declare dopt
dbms_aq.enqueue_options_t; mprop dbms_aq.message_properties_t;
msgid raw(100); begin dopt.visibility := DBMS_AQ.IMMEDIATE;
dopt.delivery_mode := DBMS_AQ.BUFFERED;
dbms_Aq.enqueue('HADOOP_MR_QUEUE', dopt, mprop, hadoop_row_obj(?,
?), msgid); commit; end;";
CallableStatement pstmt = null;
try {
pstmt = conn.prepareCall(query);
}
catch (Exception e){
System.out.println("Got exception " + e);
if (rset != null) rset.close();
if (conn != null) conn.close();
}
while ((line = stdin.readLine()) != null) {
if (line.replaceAll(" ", "").length() < 2)
continue;
String [] data = line.split(",");
if (data.length != 2)
continue;
countArr++;
pstmt.setInt(1, Integer.parseInt(data[0]));
pstmt.setInt(2, Integer.parseInt(data[1]));
pstmt.executeUpdate();
}
if (conn != null) conn.close();
System.exit(0);
return;
}
17
}
An example of running a select on the system as described above – leveraging the table function
processing – would be:
-- Set up phase for the data queue
execute DBMS_AQADM.CREATE_QUEUE_TABLE('HADOOP_MR_QUEUE',
'HADOOP_ROW_OBJ');
execute DBMS_AQADM.CREATE_QUEUE('HADOOP_MR_QUEUE',
'HADOOP_MR_QUEUE');
execute DBMS_AQADM.START_QUEUE ('HADOOP_MR_QUEUE');
-- Query being executed by an end user or BI tool
-- Note the hint is not a real hint, but a comment
-- to clarify why we use the cursor
select max(a), avg(b)
from table(hdfs_reader.read_from_hdfs_file
(cursor
(select /*+ FAKE cursor to kick start parallelism */ *
from orders), '/home/hadoop/eq_test4.sh'));
18
As the examples in this paper show integrating a Hadoop system with Oracle Database 11g is
easy to do.
The approaches discussed in this paper allow customers to stream data directly from Hadoop
into an Oracle query. This avoids fetching the data into a local file system as well as materializing
it into an Oracle table before accessing it in a SQL query.
twp-integrating-hadoop-data-with-or-130063

More Related Content

What's hot

How the Postgres Query Optimizer Works
How the Postgres Query Optimizer WorksHow the Postgres Query Optimizer Works
How the Postgres Query Optimizer Works
EDB
 
Percona xtra db cluster(pxc) non blocking operations, what you need to know t...
Percona xtra db cluster(pxc) non blocking operations, what you need to know t...Percona xtra db cluster(pxc) non blocking operations, what you need to know t...
Percona xtra db cluster(pxc) non blocking operations, what you need to know t...
Marco Tusa
 
Accessing Data Through Hibernate; What DBAs Should Tell Developers and Vice V...
Accessing Data Through Hibernate; What DBAs Should Tell Developers and Vice V...Accessing Data Through Hibernate; What DBAs Should Tell Developers and Vice V...
Accessing Data Through Hibernate; What DBAs Should Tell Developers and Vice V...
Marco Tusa
 
MySQL innoDB split and merge pages
MySQL innoDB split and merge pagesMySQL innoDB split and merge pages
MySQL innoDB split and merge pages
Marco Tusa
 
Dbvisit replicate: logical replication made easy
Dbvisit replicate: logical replication made easyDbvisit replicate: logical replication made easy
Dbvisit replicate: logical replication made easy
Franck Pachot
 
Postgresql Database Administration- Day3
Postgresql Database Administration- Day3Postgresql Database Administration- Day3
Postgresql Database Administration- Day3
PoguttuezhiniVP
 
Postgresql Database Administration Basic - Day1
Postgresql  Database Administration Basic  - Day1Postgresql  Database Administration Basic  - Day1
Postgresql Database Administration Basic - Day1
PoguttuezhiniVP
 
[PGDay.Seoul 2020] PostgreSQL 13 New Features
[PGDay.Seoul 2020] PostgreSQL 13 New Features[PGDay.Seoul 2020] PostgreSQL 13 New Features
[PGDay.Seoul 2020] PostgreSQL 13 New Features
hyeongchae lee
 
Manipulating Data using DPLYR in R Studio
Manipulating Data using DPLYR in R StudioManipulating Data using DPLYR in R Studio
Manipulating Data using DPLYR in R Studio
Rupak Roy
 
Oracle in-Memory Column Store for BI
Oracle in-Memory Column Store for BIOracle in-Memory Column Store for BI
Oracle in-Memory Column Store for BI
Franck Pachot
 
Whitepaper: Mining the AWR repository for Capacity Planning and Visualization
Whitepaper: Mining the AWR repository for Capacity Planning and VisualizationWhitepaper: Mining the AWR repository for Capacity Planning and Visualization
Whitepaper: Mining the AWR repository for Capacity Planning and Visualization
Kristofferson A
 
9780538745840 ppt ch06
9780538745840 ppt ch069780538745840 ppt ch06
9780538745840 ppt ch06
Terry Yoast
 
Testing Delphix: easy data virtualization
Testing Delphix: easy data virtualizationTesting Delphix: easy data virtualization
Testing Delphix: easy data virtualization
Franck Pachot
 
Data Migration in Database
Data Migration in DatabaseData Migration in Database
Data Migration in Database
Jingun Jung
 
R hive tutorial - udf, udaf, udtf functions
R hive tutorial - udf, udaf, udtf functionsR hive tutorial - udf, udaf, udtf functions
R hive tutorial - udf, udaf, udtf functions
Aiden Seonghak Hong
 
Postgresql Database Administration- Day4
Postgresql Database Administration- Day4Postgresql Database Administration- Day4
Postgresql Database Administration- Day4
PoguttuezhiniVP
 
Merge2013 mwarren(pv5)
Merge2013 mwarren(pv5)Merge2013 mwarren(pv5)
Merge2013 mwarren(pv5)
Mark Warren
 
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
PgDay.Seoul
 
How to teach an elephant to rock'n'roll
How to teach an elephant to rock'n'rollHow to teach an elephant to rock'n'roll
How to teach an elephant to rock'n'roll
PGConf APAC
 
Rman cloning guide
Rman cloning guideRman cloning guide
Rman cloning guide
Amit87_dba
 

What's hot (20)

How the Postgres Query Optimizer Works
How the Postgres Query Optimizer WorksHow the Postgres Query Optimizer Works
How the Postgres Query Optimizer Works
 
Percona xtra db cluster(pxc) non blocking operations, what you need to know t...
Percona xtra db cluster(pxc) non blocking operations, what you need to know t...Percona xtra db cluster(pxc) non blocking operations, what you need to know t...
Percona xtra db cluster(pxc) non blocking operations, what you need to know t...
 
Accessing Data Through Hibernate; What DBAs Should Tell Developers and Vice V...
Accessing Data Through Hibernate; What DBAs Should Tell Developers and Vice V...Accessing Data Through Hibernate; What DBAs Should Tell Developers and Vice V...
Accessing Data Through Hibernate; What DBAs Should Tell Developers and Vice V...
 
MySQL innoDB split and merge pages
MySQL innoDB split and merge pagesMySQL innoDB split and merge pages
MySQL innoDB split and merge pages
 
Dbvisit replicate: logical replication made easy
Dbvisit replicate: logical replication made easyDbvisit replicate: logical replication made easy
Dbvisit replicate: logical replication made easy
 
Postgresql Database Administration- Day3
Postgresql Database Administration- Day3Postgresql Database Administration- Day3
Postgresql Database Administration- Day3
 
Postgresql Database Administration Basic - Day1
Postgresql  Database Administration Basic  - Day1Postgresql  Database Administration Basic  - Day1
Postgresql Database Administration Basic - Day1
 
[PGDay.Seoul 2020] PostgreSQL 13 New Features
[PGDay.Seoul 2020] PostgreSQL 13 New Features[PGDay.Seoul 2020] PostgreSQL 13 New Features
[PGDay.Seoul 2020] PostgreSQL 13 New Features
 
Manipulating Data using DPLYR in R Studio
Manipulating Data using DPLYR in R StudioManipulating Data using DPLYR in R Studio
Manipulating Data using DPLYR in R Studio
 
Oracle in-Memory Column Store for BI
Oracle in-Memory Column Store for BIOracle in-Memory Column Store for BI
Oracle in-Memory Column Store for BI
 
Whitepaper: Mining the AWR repository for Capacity Planning and Visualization
Whitepaper: Mining the AWR repository for Capacity Planning and VisualizationWhitepaper: Mining the AWR repository for Capacity Planning and Visualization
Whitepaper: Mining the AWR repository for Capacity Planning and Visualization
 
9780538745840 ppt ch06
9780538745840 ppt ch069780538745840 ppt ch06
9780538745840 ppt ch06
 
Testing Delphix: easy data virtualization
Testing Delphix: easy data virtualizationTesting Delphix: easy data virtualization
Testing Delphix: easy data virtualization
 
Data Migration in Database
Data Migration in DatabaseData Migration in Database
Data Migration in Database
 
R hive tutorial - udf, udaf, udtf functions
R hive tutorial - udf, udaf, udtf functionsR hive tutorial - udf, udaf, udtf functions
R hive tutorial - udf, udaf, udtf functions
 
Postgresql Database Administration- Day4
Postgresql Database Administration- Day4Postgresql Database Administration- Day4
Postgresql Database Administration- Day4
 
Merge2013 mwarren(pv5)
Merge2013 mwarren(pv5)Merge2013 mwarren(pv5)
Merge2013 mwarren(pv5)
 
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
 
How to teach an elephant to rock'n'roll
How to teach an elephant to rock'n'rollHow to teach an elephant to rock'n'roll
How to teach an elephant to rock'n'roll
 
Rman cloning guide
Rman cloning guideRman cloning guide
Rman cloning guide
 

Viewers also liked

Coraçao puro
Coraçao puroCoraçao puro
Coraçao puro
SUSSURRO DE AMOR
 
Micky Trinidad, Marketing Brochure
Micky Trinidad, Marketing BrochureMicky Trinidad, Marketing Brochure
Micky Trinidad, Marketing Brochure
Micky Trinidad
 
Rimon forclients 62016
Rimon forclients 62016Rimon forclients 62016
Rimon forclients 62016
Rimon Water & Energy
 
SM Usman Jafri
SM Usman JafriSM Usman Jafri
SM Usman Jafri
Usman Jafri
 
Simple savant integration
Simple savant integrationSimple savant integration
Simple savant integration
Victor Razon
 
GITESH BATRA_RESUME11
GITESH BATRA_RESUME11GITESH BATRA_RESUME11
GITESH BATRA_RESUME11
gitesh batra
 
Newsletter APR'16
Newsletter APR'16Newsletter APR'16
Newsletter APR'16
Charlotte Carter
 
3_05_2
3_05_23_05_2
3_05_2
Corey Shank
 
Maduro y chavez 1
Maduro y chavez 1Maduro y chavez 1
Maduro y chavez 1
Luigi Barbero
 
Opportunity presentation ca (english)
Opportunity presentation ca (english)Opportunity presentation ca (english)
Opportunity presentation ca (english)
Kyashi175
 
nhận làm video quảng cáo uy tín
nhận làm video quảng cáo uy tínnhận làm video quảng cáo uy tín
nhận làm video quảng cáo uy tínmila205
 
Estrategia de precios en minoristas
Estrategia de precios en minoristasEstrategia de precios en minoristas
Estrategia de precios en minoristas
Gabriel Jiménez
 
Engagio @ InsideView Open Lounge 2016 - ABM + ABSD = ABE
Engagio @ InsideView Open Lounge 2016 - ABM + ABSD = ABEEngagio @ InsideView Open Lounge 2016 - ABM + ABSD = ABE
Engagio @ InsideView Open Lounge 2016 - ABM + ABSD = ABE
Engagio
 
사물인터넷 대중화를 막는 문제점과 해결방안
사물인터넷 대중화를 막는 문제점과 해결방안사물인터넷 대중화를 막는 문제점과 해결방안
사물인터넷 대중화를 막는 문제점과 해결방안
메가트렌드랩 megatrendlab
 
사물인터넷 기반의 은행권 금융서비스 제공방안
사물인터넷 기반의 은행권 금융서비스 제공방안사물인터넷 기반의 은행권 금융서비스 제공방안
사물인터넷 기반의 은행권 금융서비스 제공방안
메가트렌드랩 megatrendlab
 

Viewers also liked (15)

Coraçao puro
Coraçao puroCoraçao puro
Coraçao puro
 
Micky Trinidad, Marketing Brochure
Micky Trinidad, Marketing BrochureMicky Trinidad, Marketing Brochure
Micky Trinidad, Marketing Brochure
 
Rimon forclients 62016
Rimon forclients 62016Rimon forclients 62016
Rimon forclients 62016
 
SM Usman Jafri
SM Usman JafriSM Usman Jafri
SM Usman Jafri
 
Simple savant integration
Simple savant integrationSimple savant integration
Simple savant integration
 
GITESH BATRA_RESUME11
GITESH BATRA_RESUME11GITESH BATRA_RESUME11
GITESH BATRA_RESUME11
 
Newsletter APR'16
Newsletter APR'16Newsletter APR'16
Newsletter APR'16
 
3_05_2
3_05_23_05_2
3_05_2
 
Maduro y chavez 1
Maduro y chavez 1Maduro y chavez 1
Maduro y chavez 1
 
Opportunity presentation ca (english)
Opportunity presentation ca (english)Opportunity presentation ca (english)
Opportunity presentation ca (english)
 
nhận làm video quảng cáo uy tín
nhận làm video quảng cáo uy tínnhận làm video quảng cáo uy tín
nhận làm video quảng cáo uy tín
 
Estrategia de precios en minoristas
Estrategia de precios en minoristasEstrategia de precios en minoristas
Estrategia de precios en minoristas
 
Engagio @ InsideView Open Lounge 2016 - ABM + ABSD = ABE
Engagio @ InsideView Open Lounge 2016 - ABM + ABSD = ABEEngagio @ InsideView Open Lounge 2016 - ABM + ABSD = ABE
Engagio @ InsideView Open Lounge 2016 - ABM + ABSD = ABE
 
사물인터넷 대중화를 막는 문제점과 해결방안
사물인터넷 대중화를 막는 문제점과 해결방안사물인터넷 대중화를 막는 문제점과 해결방안
사물인터넷 대중화를 막는 문제점과 해결방안
 
사물인터넷 기반의 은행권 금융서비스 제공방안
사물인터넷 기반의 은행권 금융서비스 제공방안사물인터넷 기반의 은행권 금융서비스 제공방안
사물인터넷 기반의 은행권 금융서비스 제공방안
 

Similar to twp-integrating-hadoop-data-with-or-130063

Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
Fei Dong
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)
Soumee Maschatak
 
Finalprojectpresentation
FinalprojectpresentationFinalprojectpresentation
Finalprojectpresentation
SANTOSH WAYAL
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
Kalyan Hadoop
 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterEnhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop Cluster
IRJET Journal
 
H04502048051
H04502048051H04502048051
H04502048051
ijceronline
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
SNEHAL MASNE
 
Unit 5
Unit  5Unit  5
Unit 5
Ravi Kumar
 
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model exam
Indhujeni
 
Apache Crunch
Apache CrunchApache Crunch
Apache Crunch
Alwin James
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
pappupassindia
 
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Amazon Web Services
 
Unit 5-apache hive
Unit 5-apache hiveUnit 5-apache hive
Unit 5-apache hive
vishal choudhary
 
Leicester 2010 notes
Leicester 2010 notesLeicester 2010 notes
Leicester 2010 notes
Joanne Cook
 
Azure Data Factory usage at Aucfanlab
Azure Data Factory usage at AucfanlabAzure Data Factory usage at Aucfanlab
Azure Data Factory usage at Aucfanlab
Aucfan
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
Supriya .
 
PostgreSQL High_Performance_Cheatsheet
PostgreSQL High_Performance_CheatsheetPostgreSQL High_Performance_Cheatsheet
PostgreSQL High_Performance_Cheatsheet
Lucian Oprea
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive Analytics
Manish Chopra
 
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Amazon Web Services
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
Dave Hiltbrand
 

Similar to twp-integrating-hadoop-data-with-or-130063 (20)

Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)
 
Finalprojectpresentation
FinalprojectpresentationFinalprojectpresentation
Finalprojectpresentation
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterEnhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop Cluster
 
H04502048051
H04502048051H04502048051
H04502048051
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Unit 5
Unit  5Unit  5
Unit 5
 
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model exam
 
Apache Crunch
Apache CrunchApache Crunch
Apache Crunch
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
 
Unit 5-apache hive
Unit 5-apache hiveUnit 5-apache hive
Unit 5-apache hive
 
Leicester 2010 notes
Leicester 2010 notesLeicester 2010 notes
Leicester 2010 notes
 
Azure Data Factory usage at Aucfanlab
Azure Data Factory usage at AucfanlabAzure Data Factory usage at Aucfanlab
Azure Data Factory usage at Aucfanlab
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
 
PostgreSQL High_Performance_Cheatsheet
PostgreSQL High_Performance_CheatsheetPostgreSQL High_Performance_Cheatsheet
PostgreSQL High_Performance_Cheatsheet
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive Analytics
 
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
 

twp-integrating-hadoop-data-with-or-130063

  • 1.
  • 2. 1
  • 3. 2 The simplest way to access external files or external data on a file system from within an Oracle database is through an external table. See here for an introduction to External tables. External tables present data stored in a file system in a table format and can be used in SQL queries transparently. External tables could thus potentially be used to access data stored in HDFS (the Hadoop File System) from inside the Oracle database. Unfortunately HDFS files are not directly accessible through the normal operating system calls that the external table driver relies on. The FUSE (File system in Userspace) project provides a workaround in this case. There are a number of FUSE drivers that allow users to mount a HDFS store and treat it like a normal file system. By using one of these drivers and mounting HDFS on the database instance (on every instance if this was a RAC database), HDFS files can be easily accessed using the External Table infrastructure. Figure 1. Accessing via External Tables with in-database MapReduce In Figure 1we are utilizing Oracle Database 11g to implement in-database mapreduce as described in this article. In general, the parallel execution framework in Oracle Database 11g is sufficient to run most of the desired operations in parallel directly from the external table. The external table approach may not be suitable in some cases (say if FUSE is unavailable). Oracle Table Functions provide an alternate way to fetch data from Hadoop. Our attached example outlines one way of doing this. At a high level we implement a table function that uses
  • 4. 3 the DBMS_SCHEDULER framework to asynchronously launch an external shell script that submits a Hadoop Map-Reduce job. The table function and the mapper communicate using Oracle's Advanced Queuing feature. The Hadoop mapper en-queue's data into a common queue while the table function de-queues data from it. Since this table function can be run in parallel additional logic is used to ensure that only one of the slaves submits the External Job. Figure 2. Leveraging Table Functions for parallel processing The queue gives us load balancing since the table function could run in parallel while the Hadoop streaming job will also run in parallel with a different degree of parallelism and outside the control of Oracle's Query Coordinator.
  • 5. 4 As an example we translated the architecture shown in Figure 2 in a real example. Note that our example only shows a template implementation of using a Table Function to access data stored in Hadoop. Other, possibly better, implementations are clearly possible. The following diagrams are a technically more accurate and more detailed representation of the original schematic in Figure 2 explaining where and how we use the pieces of actual code that follow: Figure 3. Starting the Mapper jobs and retrieving data In step 1 we figure out who gets to be the query coordinator. For this we use a simple mechanism that writes records with the same key value into a table. The first insert wins and will function as the query coordinator (QC) for the process. Note that the QC table function invocation does play a processing role as well. In step 2 this table function invocation (QC) starts an asynchronous job using dbms_scheduler – the Job Controller in Figure 3 – that than runs the synchronous bash script on the Hadoop cluster. This bash script, called the launcher in Figure 3 starts the mapper processes (step 3) on the Hadoop cluster.
  • 6. 5 The mapper processes process data and write to a queue in step 5. In the current example we chose a queue as it is available cluster wide. For now we simply chose to write any output directly into the queue. You can achieve better performance by either batching up the outputs and then moving them into the queue. Obviously you can choose various other delivery mechanisms, including pipes and relational tables. Step 6 is then the de-queuing process which is done by parallel invocations of the table function running in the database. As these parallel invocations process data it gets served up to the query that is requesting the data. The table function leverages both the Oracle data and the data from the queue and thereby integrates data from both sources in a single result set for the end user(s). Figure 4. Monitoring the process After the Hadoop side processes (the mappers) are kicked off, the job monitor process keeps an eye on the launcher script. Once the mappers have finished processing data on the Hadoop cluster, the bash script finishes as is shown in Figure 4. The job monitor monitors a database scheduler queue and will notice that the shell script has finished (step 7). The job monitor checks the data queue for remaining data elements, step 8. As long as data is present in the queue the table function invocations keep on processing that data (step 6).
  • 7. 6 Figure 5. Closing down the processing When the queue is fully de-queued by the parallel invocations of the table function, the job monitor terminates the queue (step 9 in Figure 5) ensuring the table function invocations in Oracle stop. At that point all the data has been delivered to the requesting query.
  • 8. 7 The solution implemented in Figure 3 through Figure 5 uses the following code. All code is tested on Oracle Database 11g and a 5 node Hadoop cluster. As with most whitepaper text, copy these scripts into a text editor and ensure the formatting is correct. This script contains certain set up components. For example the initial part of the script shows the creation of the arbitrator table in step 1 of Figure 3. In the example we use the always popular OE schema. connect oe/oe -- Table to use as locking mechanisim for the hdfs reader as -- leveraged in Figure 3 step 1 DROP TABLE run_hdfs_read; CREATE TABLE run_hdfs_read(pk_id number, status varchar2(100), primary key (pk_id)); -- Object type used for AQ that receives the data CREATE OR REPLACE TYPE hadoop_row_obj AS OBJECT ( a number, b number); / connect / as sysdba -- system job to launch external script -- this job is used to eventually run the bash script -- described in Figure 3 step 3 CREATE OR REPLACE PROCEDURE launch_hadoop_job_async(in_directory IN VARCHAR2, id number) IS cnt number; BEGIN begin DBMS_SCHEDULER.DROP_JOB ('ExtScript'||id, TRUE);
  • 9. 8 exception when others then null; end; -- Run a script DBMS_SCHEDULER.CREATE_JOB ( job_name => 'ExtScript' || id, job_type => 'EXECUTABLE', job_action => '/bin/bash', number_of_arguments => 1 ); DBMS_SCHEDULER.SET_JOB_ARGUMENT_VALUE ('ExtScript' || id, 1, in_directory); DBMS_SCHEDULER.ENABLE('ExtScript' || id); -- Wait till the job is done. This ensures the hadoop job is completed loop select count(*) into cnt from DBA_SCHEDULER_JOBS where job_name = 'EXTSCRIPT'||id; dbms_output.put_line('Scheduler Count is '||cnt); if (cnt = 0) then exit; else dbms_lock.sleep(5); end if; end loop; -- Wait till the queue is empty and then drop it -- as shown in Figure 5 -- The TF will get an exception and it will finish quietly loop
  • 10. 9 select sum(c) into cnt from ( select enqueued_msgs - dequeued_msgs c from gv$persistent_queues where queue_name = 'HADOOP_MR_QUEUE' union all select num_msgs+spill_msgs c from gv$buffered_queues where queue_name = 'HADOOP_MR_QUEUE' union all select 0 c from dual ); if (cnt = 0) then -- Queue is done. stop it. DBMS_AQADM.STOP_QUEUE ('HADOOP_MR_QUEUE'); DBMS_AQADM.DROP_QUEUE ('HADOOP_MR_QUEUE'); return; else -- Wait for a while dbms_lock.sleep(5); end if; end loop; END; / -- Grants needed to make hadoop reader package work grant execute on launch_hadoop_job_async to oe;
  • 11. 10 grant select on v_$session to oe; grant select on v_$instance to oe; grant select on v_$px_process to oe; grant execute on dbms_aqadm to oe; grant execute on dbms_aq to oe; connect oe/oe -- Simple reader package to read a file containing two numbers CREATE OR REPLACE PACKAGE hdfs_reader IS -- Return type of pl/sql table function TYPE return_rows_t IS TABLE OF hadoop_row_obj; -- Checks if current invocation is serial FUNCTION is_serial RETURN BOOLEAN; -- Function to actually launch a Hadoop job FUNCTION launch_hadoop_job(in_directory IN VARCHAR2, id in out number) RETURN BOOLEAN; -- Tf to read from Hadoop -- This is the main processing code reading from the queue in -- Figure 3 step 6. It also contains the code to insert into -- the table in Figure 3 step 1 FUNCTION read_from_hdfs_file(pcur IN SYS_REFCURSOR, in_directory IN VARCHAR2) RETURN return_rows_t PIPELINED PARALLEL_ENABLE(PARTITION pcur BY ANY); END; /
  • 12. 11 CREATE OR REPLACE PACKAGE BODY hdfs_reader IS -- Checks if current process is a px_process FUNCTION is_serial RETURN BOOLEAN IS c NUMBER; BEGIN SELECT COUNT (*) into c FROM v$px_process WHERE sid = SYS_CONTEXT('USERENV','SESSIONID'); IF c <> 0 THEN RETURN false; ELSE RETURN true; END IF; exception when others then RAISE; END; FUNCTION launch_hadoop_job(in_directory IN VARCHAR2, id IN OUT NUMBER) RETURN BOOLEAN IS PRAGMA AUTONOMOUS_TRANSACTION; instance_id NUMBER; jname varchar2(4000); BEGIN if is_serial then -- Get id by mixing instance # and session id id := SYS_CONTEXT('USERENV', 'SESSIONID'); SELECT instance_number INTO instance_id FROM v$instance; id := instance_id * 100000 + id; else -- Get id of the QC
  • 13. 12 SELECT ownerid into id from v$session where sid = SYS_CONTEXT('USERENV', 'SESSIONID'); end if; -- Create a row to 'lock' it so only one person does the job -- schedule. Everyone else will get an exception -- This is in Figure 3 step 1 INSERT INTO run_hdfs_read VALUES(id, 'RUNNING'); jname := 'Launch_hadoop_job_async'; -- Launch a job to start the hadoop job DBMS_SCHEDULER.CREATE_JOB ( job_name => jname, job_type => 'STORED_PROCEDURE', job_action => 'sys.launch_hadoop_job_async', number_of_arguments => 2 ); DBMS_SCHEDULER.SET_JOB_ARGUMENT_VALUE (jname, 1, in_directory); DBMS_SCHEDULER.SET_JOB_ARGUMENT_VALUE (jname, 2, CAST (id AS VARCHAR2)); DBMS_SCHEDULER.ENABLE('Launch_hadoop_job_async'); COMMIT; RETURN true; EXCEPTION -- one of my siblings launched the job. Get out quitely WHEN dup_val_on_index THEN dbms_output.put_line('dup value exception'); RETURN false; WHEN OTHERs THEN RAISE; END;
  • 14. 13 FUNCTION read_from_hdfs_file(pcur IN SYS_REFCURSOR, in_directory IN VARCHAR2) RETURN return_rows_t PIPELINED PARALLEL_ENABLE(PARTITION pcur BY ANY) IS PRAGMA AUTONOMOUS_TRANSACTION; cleanup BOOLEAN; payload hadoop_row_obj; id NUMBER; dopt dbms_aq.dequeue_options_t; mprop dbms_aq.message_properties_t; msgid raw(100); BEGIN -- Launch a job to kick off the hadoop job cleanup := launch_hadoop_job(in_directory, id); dopt.visibility := DBMS_AQ.IMMEDIATE; dopt.delivery_mode := DBMS_AQ.BUFFERED; loop payload := NULL; -- Get next row DBMS_AQ.DEQUEUE('HADOOP_MR_QUEUE', dopt, mprop, payload, msgid); commit; pipe row(payload); end loop; exception when others then if cleanup then delete run_hdfs_read where pk_id = id; commit; end if;
  • 15. 14 END; END; / This short script is the outside-of-the-database controller as shown in Figure 3 steps 3 and 4. This is a synchronous step remaining on the system for as long as the Hadoop mappers are running. #!/bin/bash cd –HADOOP_HOME- A="/net/scratch/java/jdk1.6.0_16/bin/java -classpath /home/hadoop:/home/hadoop/ojdbc6.jar StreamingEq" bin/hadoop fs -rmr output bin/hadoop jar ./contrib/streaming/hadoop-0.20.0-streaming.jar - input input/nolist.txt -output output -mapper "$A" -jobconf mapred.reduce.tasks=0 For this example we wrote a small and simple mapper process to be executed on our Hadoop cluster. Undoubtedly more comprehensive mappers exist. This one converts a string into two numbers and provides those to the queue in a row-by-row fashion. // Simplified mapper example for Hadoop cluster import java.sql.*; //import oracle.jdbc.*; //import oracle.sql.*; import oracle.jdbc.pool.*; //import java.util.Arrays; //import oracle.sql.ARRAY; //import oracle.sql.ArrayDescriptor;
  • 16. 15 public class StreamingEq { public static void main (String args[]) throws Exception { // connect through driver OracleDataSource ods = new OracleDataSource(); Connection conn = null; Statement stmt = null; ResultSet rset = null; java.io.BufferedReader stdin; String line; int countArr = 0, i; oracle.sql.ARRAY sqlArray; oracle.sql.ArrayDescriptor arrDesc; try { ods.setURL("jdbc:oracle:thin:oe/oe@$ORACLE_INSTANCE "); conn = ods.getConnection(); } catch (Exception e){ System.out.println("Got exception " + e); if (rset != null) rset.close(); if (conn != null) conn.close(); System.exit(0); return ; } System.out.println("connection works conn " + conn); stdin = new java.io.BufferedReader(new java.io.InputStreamReader(System.in));
  • 17. 16 String query = "declare dopt dbms_aq.enqueue_options_t; mprop dbms_aq.message_properties_t; msgid raw(100); begin dopt.visibility := DBMS_AQ.IMMEDIATE; dopt.delivery_mode := DBMS_AQ.BUFFERED; dbms_Aq.enqueue('HADOOP_MR_QUEUE', dopt, mprop, hadoop_row_obj(?, ?), msgid); commit; end;"; CallableStatement pstmt = null; try { pstmt = conn.prepareCall(query); } catch (Exception e){ System.out.println("Got exception " + e); if (rset != null) rset.close(); if (conn != null) conn.close(); } while ((line = stdin.readLine()) != null) { if (line.replaceAll(" ", "").length() < 2) continue; String [] data = line.split(","); if (data.length != 2) continue; countArr++; pstmt.setInt(1, Integer.parseInt(data[0])); pstmt.setInt(2, Integer.parseInt(data[1])); pstmt.executeUpdate(); } if (conn != null) conn.close(); System.exit(0); return; }
  • 18. 17 } An example of running a select on the system as described above – leveraging the table function processing – would be: -- Set up phase for the data queue execute DBMS_AQADM.CREATE_QUEUE_TABLE('HADOOP_MR_QUEUE', 'HADOOP_ROW_OBJ'); execute DBMS_AQADM.CREATE_QUEUE('HADOOP_MR_QUEUE', 'HADOOP_MR_QUEUE'); execute DBMS_AQADM.START_QUEUE ('HADOOP_MR_QUEUE'); -- Query being executed by an end user or BI tool -- Note the hint is not a real hint, but a comment -- to clarify why we use the cursor select max(a), avg(b) from table(hdfs_reader.read_from_hdfs_file (cursor (select /*+ FAKE cursor to kick start parallelism */ * from orders), '/home/hadoop/eq_test4.sh'));
  • 19. 18 As the examples in this paper show integrating a Hadoop system with Oracle Database 11g is easy to do. The approaches discussed in this paper allow customers to stream data directly from Hadoop into an Oracle query. This avoids fetching the data into a local file system as well as materializing it into an Oracle table before accessing it in a SQL query.