Leveraging Hadoop for Legacy Systems

Leveraging Hadoop for Legacy Systems

Mathias Herberts - @herberts

Crédit Mutuel Arkéa key Facts & Figures (as of 2011-06-30)

A Regional Bank with a National Network

Why Hadoop?
▪ Ever increasing volume of data

▪ Very regulated sector (Basel II/III, Solvency II)

▪ Need to produce compliance reports

▪ Competitive sector

▪ Need to create value, data identified as a great source of it

▪ Keep costs under control
▪ Fond of Open Source
▪ Engineers like big challenges

Types of logical storage

Virtual Storage Access Method
Record-oriented (fixed or variable length) indexed datasets

Physically Sequential
Record-oriented (fixed or variable length) datasets, not indexed
Can exist on different types of media

IBM DB2 Relational Model Database Server

Types of binary records stored

COBOL Records (conform to a COPYBOOK)

DB2 'UNLOAD' Records (conform to a DDL statement)

Types of data stored in HDFS

{Tab, Comma, ...} Separated Values
One line records of multiple columns

Text
Line-oriented (eg logs)

Hadoop SequenceFiles
Block compressed
▪ Mostly BytesWritable key/value
▪ COBOL records
▪ DB2 unloaded records
▪ Serialized Thrift structures
▪ Use of DefaultCodec (pure Java)

Standard data transfer process

▪ On the fly charset conversion
▪ Loss of notion of records

Hadoop data transfer process

▪ On the fly compression
▪ Keep original charset
▪ Preserved notion of records

Staging Server

▪ Gateway In & Out of an HDFS Cell
▪ Reads/Writes to /hdfs/staging/{in,out}/... (runs as hdfs)
▪ HTTP Based (POST/GET)

▪ Upload to http://hadoop-staging/put[/hdfs/staging/in/...]
Stores directly in HDFS, no intermediary storage
Multiple files support
Random target directory created if none specified
Parameters user, group, perm, suffix
curl -F "file=@local;filename=remote" http://../put?user=foo&group=bar&perm=644&suffix=.test

▪ Download from http://hadoop-staging/get/hdfs/staging/out/...
Ability to unpack SequenceFile records (unpack={base64,hex}) as key:value lines

fileutil

▪ Swiss Army Knife for SequenceFiles, HDFS Staging Server, ZooKeeper
▪ Written in Java, single jar
▪ Works in all our environments (z/OS, Unix, Windows, ...)
▪ Can be ran using TWS/OPC on z/OS (via a JCL), $Universe on Unix, cron ...
▪ Multiple commands
sfstage Convert a z/OS dataset to a SF and push it to the staging server
{stream,file}stage Push a stream or files to the staging server
filesfstage Convert a file to a SF (one record per block) and stage it
sfpack Pack key:value lines (cf unpack) in a SequenceFile
sfarchive Create a SequenceFile, one record per input file
zk{ls,lsr,cat,stat}Read data from ZooKeeper
get Retrieve data via URI
...

Data Organization

▪ Use of a directory structure that mimics the datasets names

PR0172.PVS00.F7209588

Environment / Silo / Application

/hdfs/data/si/PR/01/72/PR0172.PVS00.F7209588.SUFFIX

▪ Group ACLs at the Environment/Silo/Application levels
▪ Suffix is mainly used to add .YYYYMM to Generation Data Groups
▪ Suffix added by the staging server
▪ DB2 Table unloads follow similar rules

P11DBA.T90XXXX
S4SDWH11.T4S02CTSC_H

Bastion Hosts

▪ Hadoop Cells are isolated, all accesses MUST go through a bastion host
▪ All accesses to the bastion hosts are authenticated via SSH keys
▪ Users log in using their own user
▪ No SSH port forwarding allowed
▪ All shell commands are logged
▪ Batches scheduled on bastion hosts by $Universe (use of ssh-agent)

▪ Bastion hosts can interact with their HDFS cell (hadoop fs commands)
▪ Bastion hosts can launch jobs

▪ Admin tasks, user provisioning done on NameNode

▪ Kerberos Security not used (yet?)
▪ Need for pluggable security mechanism, using SSH signed tokens

We are a Piggy bank ...
Attribution: www.seniorliving.org

Why Pig?

▪ We <3 the '1 relation per line' approach, « no SQHell™ »

▪ No metadata service to maintain
▪ Ability to add UDFs
▪ A whole lot already added, more on this later...

▪ Batch scheduling
▪ Can handle all the data we store in HDFS

▪ Still open to other tools (Hive, Zohmg, ...)

com.arkea.commons.pig.SequenceFileLoadFunc

▪ Generic load function for our BytesWritable SequenceFiles
▪ Relies on Helper classes to interpret the record bytes
SequenceFileLoadFunc('HelperClass', 'param', ...)
▪ Helper classes can also be used in regular MapReduce jobs

▪ SequenceFileLoadFunc outputs the following schema

{
key: bytearray,
value: bytearray,
parsed: (
Helper dependent schema
)
}

Helper Classes

▪ COBOL – com.arkea.commons.pig.COBOLBinaryHelper
▪ COPYBOOK
▪ Thrift – com.arkea.comons.pig.ThriftBinaryHelper
▪ .class
▪ DB2 Unload – com.arkea.commons.pig.DB2UnloadBinaryHelper
▪ DDL + load script
▪ MySQL – com.arkea.commons.pig.MySQLBinaryHelper
▪ DDL
▪ ...

Initial Pig Target

'proc sql' SAS Corpus
from sample to population

Need to give users tools that can reproduce what they did in their scripts

Groovy Closure Pig UDF

DEFINE InlineGroovyUDF cac.pig.udf.GroovyClosure(SCHEMA, CODE);

DEFINE FileGroovyUDF cac.pig.udf.GroovyClosure(SCHEMA, '/path/to/closure.groovy');

SCHEMA uses the standard Pig Schema syntax, i.e. 'str: chararray'

CODE is a short Groovy Closure, i.e. '{ a,b,c -> return a.replaceAll(b,c); }'

closure.groovy must be in a REGISTERed jar under path/to

//
// Import statements
//

import ....;

//
// Constants definitions
//

/**
* Documentation for XXX
*/
final def XXX = ....;

//
// Closure definition
//

/**
* Documentation for CLOSURE
*
* @param a ...
* @param b ...
* @param ...
*
* @return ...
*/
final def CLOSURE = {
a,b,... ->
...
...
return ...;
}

//
// Unit Tests
//

// Test specific comment ...
assert CLOSURE('A') == ...;

//
// Return Closure for usage in Pig
//

return CLOSURE;

Pig to Groovy

bag -> java.util.List
tuple -> Object[]
map -> java.util.Map
int -> int
long -> long
float -> float
double -> double
chararray -> java.lang.String
bytearray -> byte[]

Groovy to Pig

groovy.lang.Tuple -> tuple
Object[] -> tuple
java.util.List -> bag
java.util.Map -> map
byte/short/int -> int
long/BigInteger -> long
float -> float
double/BigDecimal -> double
java.lang.String -> chararray
byte[] -> bytearray

⊕

▪ Fast and rich data pipeline between z/OS and Hadoop

▪ Pig Toolbox to analyze COBOL/DB2 data alongside Thrift/MySQL/xSV/...

▪ Groovy Closure support for rapid extension

▪ Still some missing features
Pure Java compression codecs (JNI on z/OS anyone?)
Pig support for BigInteger / BigDecimal (245 might not be enough)
SSH(RSA) based auth tokens

▪ And yet another hard challenge: Cultural Change

http://www.arkea.com/

@herberts

com.arkea.commons.pig.COBOLBinaryHelper
REGISTER copybook.jar;
A = LOAD '$data' USING cacp.SequenceFileLoadFunc('cacp.COBOLBinaryHelper','[PREFIX:]COPYBOOK');

000010*GAR* OS Y7XRRDC DESCRIPTION RRDC NOUVAU FORMAT 30000020
000020* LG=00328, ESD MAJ LE 04/12/98, ELS MAJ LE 26/01/01 PAR C98310 30000030
000030* GENERE LE 26/01/01 A 17H01, PFX : Y7XRRD- MEMBRE : Y7XRRDC 30000040
000040 01 Y7XRRD-Y7XRRDC. 30000050
000050* DESCRIPTION RRDC NOUVAU FORMAT 1 04/12/98 30000060
000060 03 Y7XRRD-ARTDS-CLE-SECD. 30000070 A: {
000070* CLE SECONDAIRE ARCHIVAGE TENU DE SOLDE 1 11/02/98 30000080 key: bytearray,
000080 05 Y7XRRD-NO-CCM PIC X(4). 30000090
000090* NUMERO CAISSE 1 28/12/94 30000100
value: bytearray,
000100 05 Y7XRRD-NO-PSE PIC X(8). 30000110 parsed: (
000110* NUMERO PERSONNE 5 10/07/97 30000120 Y7XRRD_Y7XRRDC: bytearray,
000120 05 Y7XRRD-CATEGORIE PIC X(2). 30000130
000130* CATéGORIE DU COMPTE 13 09/01/01 30000140 Y7XRRD_ARTDS_CLE_SECD: bytearray,
000140 05 Y7XRRD-RANG PIC X(2). 30000150 Y7XRRD_NO_CCM: chararray,
010010* RANG 15 22/01/01 30000160
010020 05 Y7XRRD-NO-ORDRE PIC X(2). 30000170 Y7XRRD_NO_PSE: chararray,
010030* Numéro d'ordre 17 28/12/94 30000180 Y7XRRD_CATEGORIE: chararray,
010040 05 Y7XRRD-DA-TT-C2 PIC X(8). 30000190
010050* DATE TRAITEMENT SX:-C2 19 - - 30000200 Y7XRRD_RANG: chararray,
010060 05 Y7XRRD-NO-ORDRE-ENR-C2 PIC 9(6). 30000210 Y7XRRD_NO_ORDRE: chararray,
010070* Numéro d'ordre enregistrement SX:-C2 27 - - 30000220
010080 03 Y7XRRD-MT-OPE-TDS PIC S9(13)V9(2) COMP-3. 30000230
Y7XRRD_DA_TT_C2: chararray,
010090* MONTANT OPERATION TENUE-DE-SOLDE 33 03/02/98 30000240 Y7XRRD_NO_ORDRE_ENR_C2: long,
010100 03 Y7XRRD-CD-DVS-ORI-OPE PIC X(4). 30000250
010110* CODE DEVISE ORIGINE OPERATION 41 - - 30000260
Y7XRRD_MT_OPE_TDS: double,
010120 03 Y7XRRD-CD-DVS-GTN-TDS PIC X(4). 30000270 Y7XRRD_CD_DVS_ORI_OPE: chararray,
010130* CODE DEVISE GESTION TENUE-DE-SOLDE 45 - - 30000280 Y7XRRD_CD_DVS_GTN_TDS: chararray,
010140 03 Y7XRRD-MT-CNVS-OPE PIC S9(13)V9(2) COMP-3. 30000290
020010* MONTANT CONVERTI OPERATION 49 - - 30000300 Y7XRRD_MT_CNVS_OPE: double,
020020 03 Y7XRRD-IDC-ATN-ORI-MT PIC X(1). 30000310 Y7XRRD_IDC_ATN_ORI_MT: chararray,
020030* INDICATEUR AUTHENTICITE ORIGINE MONTAN 57 05/12/97 30000320
020040 03 Y7XRRD-SLD-AV-IMPT PIC S9(13)V9(2) COMP-3. 30000330 Y7XRRD_SLD_AV_IMPT: double,
020050* SOLDE AVANT IMPUTATION 58 03/02/98 30000340 Y7XRRD_DA_OPE_TDS: chararray,
020060 03 Y7XRRD-DA-OPE-TDS PIC X(8). 30000350
020070* DATE OPERATION TENUE-DE-SOLDE 66 - - 30000360 Y7XRRD_DA_VLR: chararray,
020080 03 Y7XRRD-DA-VLR PIC X(8). 30000370 Y7XRRD_DA_ARR: chararray,
020090* DATE VALEUR 74 28/12/94 30000380
020100 03 Y7XRRD-DA-ARR PIC X(8). 30000390
Y7XRRD_NO_STR_OPE: chararray,
020110* DATE ARRETE 82 - - 30000400 Y7XRRD_NO_REF_TNL_MED: chararray,
020120 03 Y7XRRD-NO-STR-OPE PIC X(6). 30000410 Y7XRRD_NO_LOT: chararray,
020130* NUMERO STRUCTURE OPERATIONNELLE 90 - - 30000420
020140 03 Y7XRRD-NO-REF-TNL-MED PIC X(4). 30000430 Y7XRRD_TDS_LIBELLES: bytearray,
030010* NUMERO REFERENCE TERMINAL MEDIA 96 03/02/98 30000440 Y7XRRD_LIB_CLI_OPE_1: chararray,
030020 03 Y7XRRD-NO-LOT PIC X(3). 30000450
030030* NUMéRO DE LOT 100 13/10/97 30000460 Y7XRRD_LIB_ITE_OPE: chararray,
030040 03 Y7XRRD-TDS-LIBELLES. 30000470 Y7XRRD_LIB_CT_CLI: chararray,
030050* FAMILLE MONTANTS OPERATION T.DE.SOLDE 103 05/02/98 30000480
030060 05 Y7XRRD-LIB-CLI-OPE-1 PIC X(50). 30000490 Y7XRRD_CD_UTI_LIB_CPL: chararray,
030070* LIBELLE CLIENT OPERATION SX:-1 103 03/02/98 30000500 Y7XRRD_IDC_COM_OPE: chararray,
030080 05 Y7XRRD-LIB-ITE-OPE PIC X(32). 30000510
030090* LIBELLE INTERNE OPERATION 153 - - 30000520
Y7XRRD_CD_TY_OPE_NIV_1: chararray,
030100 05 Y7XRRD-LIB-CT-CLI PIC X(32). 30000530 Y7XRRD_CD_TY_OPE_NIV_2: chararray,
030110* LIBELLE COURT CLIENT 185 - - 30000540 FILLER: chararray,
030120 03 Y7XRRD-CD-UTI-LIB-CPL PIC X(1). 30000550
030130* Code utilisation libellés compl. 217 28/12/94 30000560 Y7XRRD_TDS_LIB_SUPPL: bytearray,
030140 03 Y7XRRD-IDC-COM-OPE PIC X(1). 30000570 Y7XRRD_LIB_CLI_OPE_02: chararray,
040010* INDICATEUR COMMISSION OPERATION 218 03/02/98 30000580
040020 03 Y7XRRD-CD-TY-OPE-NIV-1 PIC X(1). 30000590 Y7XRRD_LIB_CLI_OPE_03: chararray
040030* CODE TYPE OPERATION NIVEAU UN 219 - - 30000600 )
040040 03 Y7XRRD-CD-TY-OPE-NIV-2 PIC X(2). 30000610
040050* CODE TYPE OPERATION NIVEAU DEUX 220 - - 30000620 }
040060 03 FILLER PIC X(7). 30000630
040070* 222 30000640
040080 03 Y7XRRD-TDS-LIB-SUPPL. 30000650
040090* FAMILLE LIBELLES COMPLEMENTAIRES T.D.S 229 17/02/98 30000660
040100 05 Y7XRRD-LIB-CLI-OPE-02 PIC X(50). 30000670
040110* LIBELLE CLIENT OPERATION SX:-02 229 03/02/98 30000680
040120 05 Y7XRRD-LIB-CLI-OPE-03 PIC X(50). 30000690
040130* LIBELLE CLIENT OPERATION SX:-03 279 - - 30000700

com.arkea.commons.pig.DB2UnloadBinaryHelper
REGISTER ddl-load.jar;
A = LOAD '$data' USING cacp.SequenceFileLoadFunc('cacp.DB2UnloadBinaryHelper','[PREFIX:]TABLE');

CREATE TABLE SHDBA.TBDCOLS
(COL_CHAR CHAR(4) FOR SBCS DATA WITH DEFAULT NULL,
COL_DECIMAL DECIMAL(15, 2) WITH DEFAULT NULL,
COL_NUMERIC DECIMAL(15, 0) WITH DEFAULT NULL,
.ddl COL_SMALLINT SMALLINT WITH DEFAULT NULL,
COL_INTEGER INTEGER WITH DEFAULT NULL, A: {
COL_VARCHAR VARCHAR(50) FOR SBCS DATA WITH DEFAULT NULL, key: bytearray,
COL_DATE DATE WITH DEFAULT NULL, value: bytearray,
COL_TIME TIME WITH DEFAULT NULL, parsed: (
COL_TIMESTAMP TIMESTAMP WITH DEFAULT NULL) ; COL_CHAR: chararray,
COL_DECIMAL: double,
COL_NUMERIC: long,
COL_SMALLINT: long,
TEMPLATE DFEM8ERT
COL_INTEGER: long,
DSN('XXXXX.PPSDR.B99BD02.SBDCOLS.REC')
COL_VARCHAR: chararray,
DISP(OLD,KEEP,KEEP)
COL_DATE: chararray,
LOAD DATA INDDN DFEM8ERT LOG NO RESUME YES
COL_TIME: chararray,
EBCDIC CCSID(01147,00000,00000)
COL_TIMESTAMP:
INTO TABLE "SHDBA"."TBDCOLS"
chararray
WHEN(00001:00002) = X'003F'
)
.load ( "COL_CHAR" POSITION( 00004:00007) CHAR(00004) NULLIF(00003)=X'FF',
}
"COL_DECIMAL" POSITION( 00009:00016) DECIMAL NULLIF(00008)=X'FF',
"COL_NUMERIC" POSITION( 00018:00025) DECIMAL NULLIF(00017)=X'FF',
"COL_SMALLINT" POSITION( 00027:00028) SMALLINT NULLIF(00026)=X'FF',
"COL_INTEGER" POSITION( 00030:00033) INTEGER NULLIF(00029)=X'FF',
"COL_VARCHAR" POSITION( 00035:00086) VARCHAR NULLIF(00034)=X'FF',
"COL_DATE" POSITION( 00088:00097) DATE EXTERNAL NULLIF(00087)=X'FF',
"COL_TIME" POSITION( 00099:00106) TIME EXTERNAL NULLIF(00098)=X'FF',
"COL_TIMESTAMP" POSITION( 00108:00133) TIMESTAMP EXTERNAL NULLIF(00107)=X'FF'
)

Can also handle DB2 UDB unloads (done using hpu)

we're Thrifty too...
REGISTER thrift-generated.jar;
A = LOAD '$data' USING cacp.SequenceFileLoadFunc('cacp.ThiftBinaryHelper','CLASS');

struct Redirection{ A: {
1: string alias, key: bytearray,
2: string url, value: bytearray,
3: string email, parsed: (
4: i64 timestamp, alias: chararray,
5: i64 lastupdate, url: chararray,
6: list<string> params, email: chararray,
7: bool external = 1, timestamp: long,
8: i64 owner, lastupdate: long,
9: string user, params: (),
} external: long,
owner: long,
user: chararray
)
}

... and also use MySQL ...

REGISTER mysql-ddl.jar;
A = LOAD '$data' USING cacp.SequenceFileLoadFunc('cacp.MySQLBinaryHelper','TABLE');

... etc etc etc ...

Leveraging Hadoop for Legacy Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Leveraging Hadoop for Legacy Systems

Similar to Leveraging Hadoop for Legacy Systems (20)

More from Mathias Herberts

More from Mathias Herberts (9)

Recently uploaded

Recently uploaded (20)

Leveraging Hadoop for Legacy Systems