Leveraging Hadoop for Legacy Systems

   Mathias Herberts - @herberts
Crédit Mutuel Arkéa key Facts & Figures (as of 2011-06-30)
A Regional Bank with a National Network
Why Hadoop?
Why Hadoop?
▪ Ever increasing volume of data

▪ Very regulated sector (Basel II/III, Solvency II)

    ▪ Need to produce compliance reports

▪ Competitive sector

    ▪ Need to create value, data identified as a great source of it

▪ Keep costs under control
▪ Fond of Open Source
▪ Engineers like big challenges
What Challenge?
Storing Data
Types of logical storage


      Virtual Storage Access Method
      Record-oriented (fixed or variable length) indexed datasets




      Physically Sequential
      Record-oriented (fixed or variable length) datasets, not indexed
      Can exist on different types of media




      IBM DB2 Relational Model Database Server
Types of binary records stored

COBOL Records (conform to a COPYBOOK)




DB2 'UNLOAD' Records (conform to a DDL statement)
Types of data stored in HDFS


      {Tab, Comma, ...} Separated Values
      One line records of multiple columns




      Text
      Line-oriented (eg logs)




      Hadoop SequenceFiles
      Block compressed
       ▪ Mostly BytesWritable key/value
        ▪ COBOL records
        ▪ DB2 unloaded records
        ▪ Serialized Thrift structures
       ▪ Use of DefaultCodec (pure Java)
Moving Data
Standard data transfer process




  ▪ On the fly charset conversion
  ▪ Loss of notion of records
Hadoop data transfer process




  ▪ On the fly compression
  ▪ Keep original charset
  ▪ Preserved notion of records
Staging Server



▪ Gateway In & Out of an HDFS Cell
▪ Reads/Writes to /hdfs/staging/{in,out}/... (runs as hdfs)
▪ HTTP Based (POST/GET)

▪ Upload to http://hadoop-staging/put[/hdfs/staging/in/...]
   Stores directly in HDFS, no intermediary storage
   Multiple files support
   Random target directory created if none specified
   Parameters user, group, perm, suffix
   curl -F "file=@local;filename=remote" http://../put?user=foo&group=bar&perm=644&suffix=.test


▪ Download from http://hadoop-staging/get/hdfs/staging/out/...
   Ability to unpack SequenceFile records (unpack={base64,hex}) as key:value lines
fileutil



▪   Swiss Army Knife for SequenceFiles, HDFS Staging Server, ZooKeeper
▪   Written in Java, single jar
▪   Works in all our environments (z/OS, Unix, Windows, ...)
▪   Can be ran using TWS/OPC on z/OS (via a JCL), $Universe on Unix, cron ...
▪   Multiple commands
      sfstage            Convert a z/OS dataset to a SF and push it to the staging server
      {stream,file}stage Push a stream or files to the staging server
      filesfstage        Convert a file to a SF (one record per block) and stage it
      sfpack             Pack key:value lines (cf unpack) in a SequenceFile
      sfarchive          Create a SequenceFile, one record per input file
      zk{ls,lsr,cat,stat}Read data from ZooKeeper
      get                Retrieve data via URI
      ...
Accessing Data
Data Organization


▪ Use of a directory structure that mimics the datasets names

      PR0172.PVS00.F7209588

      Environment / Silo / Application

      /hdfs/data/si/PR/01/72/PR0172.PVS00.F7209588.SUFFIX

▪   Group ACLs at the Environment/Silo/Application levels
▪   Suffix is mainly used to add .YYYYMM to Generation Data Groups
▪   Suffix added by the staging server
▪   DB2 Table unloads follow similar rules

      P11DBA.T90XXXX
      S4SDWH11.T4S02CTSC_H
Bastion Hosts



▪   Hadoop Cells are isolated, all accesses MUST go through a bastion host
▪   All accesses to the bastion hosts are authenticated via SSH keys
▪   Users log in using their own user
▪   No SSH port forwarding allowed
▪   All shell commands are logged
▪   Batches scheduled on bastion hosts by $Universe (use of ssh-agent)

▪ Bastion hosts can interact with their HDFS cell (hadoop fs commands)
▪ Bastion hosts can launch jobs

▪ Admin tasks, user provisioning done on NameNode

▪ Kerberos Security not used (yet?)
▪ Need for pluggable security mechanism, using SSH signed tokens
Working With Data
We are a Piggy bank ...
                      Attribution: www.seniorliving.org
Why Pig?



▪ We <3 the '1 relation per line' approach, « no SQHell™ »




▪ No metadata service to maintain
▪ Ability to add UDFs
    ▪ A whole lot already added, more on this later...

▪ Batch scheduling
▪ Can handle all the data we store in HDFS

▪ Still open to other tools (Hive, Zohmg, ...)
com.arkea.commons.pig.SequenceFileLoadFunc




▪ Generic load function for our BytesWritable SequenceFiles
▪ Relies on Helper classes to interpret the record bytes
    SequenceFileLoadFunc('HelperClass', 'param', ...)
▪ Helper classes can also be used in regular MapReduce jobs

▪ SequenceFileLoadFunc outputs the following schema

{
    key: bytearray,
    value: bytearray,
    parsed: (
      Helper dependent schema
    )
}
Helper Classes



▪ COBOL – com.arkea.commons.pig.COBOLBinaryHelper
        ▪ COPYBOOK
▪ Thrift – com.arkea.comons.pig.ThriftBinaryHelper
        ▪ .class
▪ DB2 Unload – com.arkea.commons.pig.DB2UnloadBinaryHelper
        ▪ DDL + load script
▪ MySQL – com.arkea.commons.pig.MySQLBinaryHelper
        ▪ DDL
▪ ...
Initial Pig Target




           'proc sql' SAS Corpus
                          from sample to population


Need to give users tools that can reproduce what they did in their scripts
Groovy Closure Pig UDF



DEFINE InlineGroovyUDF cac.pig.udf.GroovyClosure(SCHEMA, CODE);

DEFINE FileGroovyUDF cac.pig.udf.GroovyClosure(SCHEMA, '/path/to/closure.groovy');




SCHEMA uses the standard Pig Schema syntax, i.e. 'str: chararray'

CODE is a short Groovy Closure, i.e. '{ a,b,c -> return a.replaceAll(b,c); }'

closure.groovy must be in a REGISTERed jar under path/to
//
// Import statements
//

import ....;

//
// Constants definitions
//

/**
 * Documentation for XXX
 */
final def XXX = ....;

//
// Closure definition
//

/**
  * Documentation for CLOSURE
  *
  * @param a ...
  * @param b ...
  * @param ...
  *
  * @return ...
  */
final def CLOSURE = {
    a,b,... ->
    ...
    ...
    return ...;
}

//
// Unit Tests
//

// Test specific comment ...
assert CLOSURE('A') == ...;

//
// Return Closure for usage in Pig
//

return CLOSURE;
Pig to Groovy

bag -> java.util.List
tuple -> Object[]
map -> java.util.Map
int -> int
long -> long
float -> float
double -> double
chararray -> java.lang.String
bytearray -> byte[]

Groovy to Pig

groovy.lang.Tuple -> tuple
Object[] -> tuple
java.util.List -> bag
java.util.Map -> map
byte/short/int -> int
long/BigInteger -> long
float -> float
double/BigDecimal -> double
java.lang.String -> chararray
byte[] -> bytearray
Wrap Up
⊕

▪ Fast and rich data pipeline between z/OS and Hadoop

▪ Pig Toolbox to analyze COBOL/DB2 data alongside Thrift/MySQL/xSV/...

▪ Groovy Closure support for rapid extension


▪ Still some missing features
    Pure Java compression codecs (JNI on z/OS anyone?)
    Pig support for BigInteger / BigDecimal (245 might not be enough)
    SSH(RSA) based auth tokens



▪ And yet another hard challenge: Cultural Change
http://www.arkea.com/



      @herberts
Appendix
com.arkea.commons.pig.COBOLBinaryHelper
REGISTER copybook.jar;
A = LOAD '$data' USING cacp.SequenceFileLoadFunc('cacp.COBOLBinaryHelper','[PREFIX:]COPYBOOK');

        000010*GAR* OS Y7XRRDC         DESCRIPTION RRDC NOUVAU FORMAT               30000020
        000020* LG=00328, ESD MAJ LE 04/12/98, ELS MAJ LE 26/01/01 PAR   C98310     30000030
        000030* GENERE LE 26/01/01 A 17H01, PFX : Y7XRRD-     MEMBRE :   Y7XRRDC    30000040
        000040 01         Y7XRRD-Y7XRRDC.                                           30000050
        000050*             DESCRIPTION RRDC NOUVAU FORMAT           1   04/12/98   30000060
        000060   03       Y7XRRD-ARTDS-CLE-SECD.                                    30000070   A: {
        000070*             CLE SECONDAIRE ARCHIVAGE TENU DE SOLDE   1   11/02/98   30000080     key: bytearray,
        000080     05     Y7XRRD-NO-CCM       PIC X(4).                             30000090
        000090*             NUMERO CAISSE                            1   28/12/94   30000100
                                                                                                 value: bytearray,
        000100     05     Y7XRRD-NO-PSE       PIC X(8).                             30000110     parsed: (
        000110*             NUMERO PERSONNE                          5   10/07/97   30000120        Y7XRRD_Y7XRRDC: bytearray,
        000120     05     Y7XRRD-CATEGORIE    PIC X(2).                             30000130
        000130*             CATéGORIE DU COMPTE                     13   09/01/01   30000140        Y7XRRD_ARTDS_CLE_SECD: bytearray,
        000140     05     Y7XRRD-RANG         PIC X(2).                             30000150        Y7XRRD_NO_CCM: chararray,
        010010*             RANG                                    15   22/01/01   30000160
        010020     05     Y7XRRD-NO-ORDRE     PIC X(2).                             30000170        Y7XRRD_NO_PSE: chararray,
        010030*             Numéro d'ordre                          17   28/12/94   30000180        Y7XRRD_CATEGORIE: chararray,
        010040     05     Y7XRRD-DA-TT-C2     PIC X(8).                             30000190
        010050*             DATE TRAITEMENT                 SX:-C2 19     -   -     30000200        Y7XRRD_RANG: chararray,
        010060     05     Y7XRRD-NO-ORDRE-ENR-C2 PIC 9(6).                          30000210        Y7XRRD_NO_ORDRE: chararray,
        010070*             Numéro d'ordre enregistrement   SX:-C2 27     -   -     30000220
        010080   03       Y7XRRD-MT-OPE-TDS   PIC S9(13)V9(2) COMP-3.               30000230
                                                                                                    Y7XRRD_DA_TT_C2: chararray,
        010090*             MONTANT OPERATION TENUE-DE-SOLDE        33   03/02/98   30000240        Y7XRRD_NO_ORDRE_ENR_C2: long,
        010100   03       Y7XRRD-CD-DVS-ORI-OPE PIC X(4).                           30000250
        010110*             CODE DEVISE ORIGINE OPERATION           41    -   -     30000260
                                                                                                    Y7XRRD_MT_OPE_TDS: double,
        010120   03       Y7XRRD-CD-DVS-GTN-TDS PIC X(4).                           30000270        Y7XRRD_CD_DVS_ORI_OPE: chararray,
        010130*             CODE DEVISE GESTION TENUE-DE-SOLDE      45    -   -     30000280        Y7XRRD_CD_DVS_GTN_TDS: chararray,
        010140   03       Y7XRRD-MT-CNVS-OPE PIC S9(13)V9(2) COMP-3.                30000290
        020010*             MONTANT CONVERTI OPERATION              49    -   -     30000300        Y7XRRD_MT_CNVS_OPE: double,
        020020   03       Y7XRRD-IDC-ATN-ORI-MT PIC X(1).                           30000310        Y7XRRD_IDC_ATN_ORI_MT: chararray,
        020030*             INDICATEUR AUTHENTICITE ORIGINE MONTAN 57    05/12/97   30000320
        020040   03       Y7XRRD-SLD-AV-IMPT PIC S9(13)V9(2) COMP-3.                30000330        Y7XRRD_SLD_AV_IMPT: double,
        020050*             SOLDE AVANT IMPUTATION                  58   03/02/98   30000340        Y7XRRD_DA_OPE_TDS: chararray,
        020060   03       Y7XRRD-DA-OPE-TDS   PIC X(8).                             30000350
        020070*             DATE OPERATION TENUE-DE-SOLDE           66    -   -     30000360        Y7XRRD_DA_VLR: chararray,
        020080   03       Y7XRRD-DA-VLR       PIC X(8).                             30000370        Y7XRRD_DA_ARR: chararray,
        020090*             DATE VALEUR                             74   28/12/94   30000380
        020100   03       Y7XRRD-DA-ARR       PIC X(8).                             30000390
                                                                                                    Y7XRRD_NO_STR_OPE: chararray,
        020110*             DATE ARRETE                             82    -   -     30000400        Y7XRRD_NO_REF_TNL_MED: chararray,
        020120   03       Y7XRRD-NO-STR-OPE   PIC X(6).                             30000410        Y7XRRD_NO_LOT: chararray,
        020130*             NUMERO STRUCTURE OPERATIONNELLE         90    -   -     30000420
        020140   03       Y7XRRD-NO-REF-TNL-MED PIC X(4).                           30000430        Y7XRRD_TDS_LIBELLES: bytearray,
        030010*             NUMERO REFERENCE TERMINAL MEDIA         96   03/02/98   30000440        Y7XRRD_LIB_CLI_OPE_1: chararray,
        030020   03       Y7XRRD-NO-LOT       PIC X(3).                             30000450
        030030*             NUMéRO DE LOT                          100   13/10/97   30000460        Y7XRRD_LIB_ITE_OPE: chararray,
        030040   03       Y7XRRD-TDS-LIBELLES.                                      30000470        Y7XRRD_LIB_CT_CLI: chararray,
        030050*             FAMILLE MONTANTS OPERATION T.DE.SOLDE 103    05/02/98   30000480
        030060     05     Y7XRRD-LIB-CLI-OPE-1 PIC X(50).                           30000490        Y7XRRD_CD_UTI_LIB_CPL: chararray,
        030070*             LIBELLE CLIENT OPERATION        SX:-1 103    03/02/98   30000500        Y7XRRD_IDC_COM_OPE: chararray,
        030080     05     Y7XRRD-LIB-ITE-OPE PIC X(32).                             30000510
        030090*             LIBELLE INTERNE OPERATION              153    -   -     30000520
                                                                                                    Y7XRRD_CD_TY_OPE_NIV_1: chararray,
        030100     05     Y7XRRD-LIB-CT-CLI   PIC X(32).                            30000530        Y7XRRD_CD_TY_OPE_NIV_2: chararray,
        030110*             LIBELLE COURT CLIENT                   185    -   -     30000540        FILLER: chararray,
        030120   03       Y7XRRD-CD-UTI-LIB-CPL PIC X(1).                           30000550
        030130*             Code utilisation libellés compl.       217   28/12/94   30000560        Y7XRRD_TDS_LIB_SUPPL: bytearray,
        030140   03       Y7XRRD-IDC-COM-OPE PIC X(1).                              30000570        Y7XRRD_LIB_CLI_OPE_02: chararray,
        040010*             INDICATEUR COMMISSION OPERATION        218   03/02/98   30000580
        040020   03       Y7XRRD-CD-TY-OPE-NIV-1 PIC X(1).                          30000590        Y7XRRD_LIB_CLI_OPE_03: chararray
        040030*             CODE TYPE OPERATION NIVEAU UN          219    -   -     30000600     )
        040040   03       Y7XRRD-CD-TY-OPE-NIV-2 PIC X(2).                          30000610
        040050*             CODE TYPE OPERATION NIVEAU DEUX        220    -   -     30000620   }
        040060   03       FILLER              PIC X(7).                             30000630
        040070*                                                    222              30000640
        040080   03       Y7XRRD-TDS-LIB-SUPPL.                                     30000650
        040090*             FAMILLE LIBELLES COMPLEMENTAIRES T.D.S 229   17/02/98   30000660
        040100     05     Y7XRRD-LIB-CLI-OPE-02 PIC X(50).                          30000670
        040110*             LIBELLE CLIENT OPERATION        SX:-02 229   03/02/98   30000680
        040120     05     Y7XRRD-LIB-CLI-OPE-03 PIC X(50).                          30000690
        040130*             LIBELLE CLIENT OPERATION        SX:-03 279    -   -     30000700
com.arkea.commons.pig.DB2UnloadBinaryHelper
 REGISTER ddl-load.jar;
 A = LOAD '$data' USING cacp.SequenceFileLoadFunc('cacp.DB2UnloadBinaryHelper','[PREFIX:]TABLE');



        CREATE TABLE SHDBA.TBDCOLS
        (COL_CHAR CHAR(4) FOR SBCS DATA WITH DEFAULT NULL,
        COL_DECIMAL DECIMAL(15, 2) WITH DEFAULT NULL,
        COL_NUMERIC DECIMAL(15, 0) WITH DEFAULT NULL,
.ddl    COL_SMALLINT SMALLINT WITH DEFAULT NULL,
        COL_INTEGER INTEGER WITH DEFAULT NULL,                                          A: {
        COL_VARCHAR VARCHAR(50) FOR SBCS DATA WITH DEFAULT NULL,                          key: bytearray,
        COL_DATE DATE WITH DEFAULT NULL,                                                  value: bytearray,
        COL_TIME TIME WITH DEFAULT NULL,                                                  parsed: (
        COL_TIMESTAMP TIMESTAMP WITH DEFAULT NULL) ;                                         COL_CHAR: chararray,
                                                                                             COL_DECIMAL: double,
                                                                                             COL_NUMERIC: long,
                                                                                             COL_SMALLINT: long,
        TEMPLATE DFEM8ERT
                                                                                             COL_INTEGER: long,
        DSN('XXXXX.PPSDR.B99BD02.SBDCOLS.REC')
                                                                                             COL_VARCHAR: chararray,
        DISP(OLD,KEEP,KEEP)
                                                                                             COL_DATE: chararray,
        LOAD DATA INDDN DFEM8ERT LOG NO RESUME YES
                                                                                             COL_TIME: chararray,
        EBCDIC CCSID(01147,00000,00000)
                                                                                             COL_TIMESTAMP:
        INTO TABLE "SHDBA"."TBDCOLS"
                                                                                        chararray
        WHEN(00001:00002) = X'003F'
                                                                                          )
.load   ( "COL_CHAR" POSITION( 00004:00007) CHAR(00004) NULLIF(00003)=X'FF',
                                                                                        }
        "COL_DECIMAL" POSITION( 00009:00016) DECIMAL NULLIF(00008)=X'FF',
        "COL_NUMERIC" POSITION( 00018:00025) DECIMAL NULLIF(00017)=X'FF',
        "COL_SMALLINT" POSITION( 00027:00028) SMALLINT NULLIF(00026)=X'FF',
        "COL_INTEGER" POSITION( 00030:00033) INTEGER NULLIF(00029)=X'FF',
        "COL_VARCHAR" POSITION( 00035:00086) VARCHAR NULLIF(00034)=X'FF',
        "COL_DATE" POSITION( 00088:00097) DATE EXTERNAL NULLIF(00087)=X'FF',
        "COL_TIME" POSITION( 00099:00106) TIME EXTERNAL NULLIF(00098)=X'FF',
        "COL_TIMESTAMP" POSITION( 00108:00133) TIMESTAMP EXTERNAL NULLIF(00107)=X'FF'
        )



Can also handle DB2 UDB unloads (done using hpu)
we're Thrifty too...
   REGISTER thrift-generated.jar;
   A = LOAD '$data' USING cacp.SequenceFileLoadFunc('cacp.ThiftBinaryHelper','CLASS');



                       struct Redirection{                    A: {
                         1: string alias,                       key: bytearray,
                         2: string url,                         value: bytearray,
                         3: string email,                       parsed: (
                         4: i64 timestamp,                         alias: chararray,
                         5: i64 lastupdate,                        url: chararray,
                         6: list<string> params,                   email: chararray,
                         7: bool external = 1,                     timestamp: long,
                         8: i64 owner,                             lastupdate: long,
                         9: string user,                           params: (),
                       }                                           external: long,
                                                                   owner: long,
                                                                   user: chararray
                                                                )
                                                              }




... and also use MySQL ...

   REGISTER mysql-ddl.jar;
   A = LOAD '$data' USING cacp.SequenceFileLoadFunc('cacp.MySQLBinaryHelper','TABLE');



... etc etc etc ...

Leveraging Hadoop for Legacy Systems

  • 1.
    Leveraging Hadoop forLegacy Systems Mathias Herberts - @herberts
  • 2.
    Crédit Mutuel Arkéakey Facts & Figures (as of 2011-06-30)
  • 3.
    A Regional Bankwith a National Network
  • 4.
  • 5.
    Why Hadoop? ▪ Everincreasing volume of data ▪ Very regulated sector (Basel II/III, Solvency II) ▪ Need to produce compliance reports ▪ Competitive sector ▪ Need to create value, data identified as a great source of it ▪ Keep costs under control ▪ Fond of Open Source ▪ Engineers like big challenges
  • 6.
  • 8.
  • 9.
    Types of logicalstorage Virtual Storage Access Method Record-oriented (fixed or variable length) indexed datasets Physically Sequential Record-oriented (fixed or variable length) datasets, not indexed Can exist on different types of media IBM DB2 Relational Model Database Server
  • 10.
    Types of binaryrecords stored COBOL Records (conform to a COPYBOOK) DB2 'UNLOAD' Records (conform to a DDL statement)
  • 11.
    Types of datastored in HDFS {Tab, Comma, ...} Separated Values One line records of multiple columns Text Line-oriented (eg logs) Hadoop SequenceFiles Block compressed ▪ Mostly BytesWritable key/value ▪ COBOL records ▪ DB2 unloaded records ▪ Serialized Thrift structures ▪ Use of DefaultCodec (pure Java)
  • 12.
  • 13.
    Standard data transferprocess ▪ On the fly charset conversion ▪ Loss of notion of records
  • 14.
    Hadoop data transferprocess ▪ On the fly compression ▪ Keep original charset ▪ Preserved notion of records
  • 15.
    Staging Server ▪ GatewayIn & Out of an HDFS Cell ▪ Reads/Writes to /hdfs/staging/{in,out}/... (runs as hdfs) ▪ HTTP Based (POST/GET) ▪ Upload to http://hadoop-staging/put[/hdfs/staging/in/...] Stores directly in HDFS, no intermediary storage Multiple files support Random target directory created if none specified Parameters user, group, perm, suffix curl -F "file=@local;filename=remote" http://../put?user=foo&group=bar&perm=644&suffix=.test ▪ Download from http://hadoop-staging/get/hdfs/staging/out/... Ability to unpack SequenceFile records (unpack={base64,hex}) as key:value lines
  • 16.
    fileutil ▪ Swiss Army Knife for SequenceFiles, HDFS Staging Server, ZooKeeper ▪ Written in Java, single jar ▪ Works in all our environments (z/OS, Unix, Windows, ...) ▪ Can be ran using TWS/OPC on z/OS (via a JCL), $Universe on Unix, cron ... ▪ Multiple commands sfstage Convert a z/OS dataset to a SF and push it to the staging server {stream,file}stage Push a stream or files to the staging server filesfstage Convert a file to a SF (one record per block) and stage it sfpack Pack key:value lines (cf unpack) in a SequenceFile sfarchive Create a SequenceFile, one record per input file zk{ls,lsr,cat,stat}Read data from ZooKeeper get Retrieve data via URI ...
  • 17.
  • 18.
    Data Organization ▪ Useof a directory structure that mimics the datasets names PR0172.PVS00.F7209588 Environment / Silo / Application /hdfs/data/si/PR/01/72/PR0172.PVS00.F7209588.SUFFIX ▪ Group ACLs at the Environment/Silo/Application levels ▪ Suffix is mainly used to add .YYYYMM to Generation Data Groups ▪ Suffix added by the staging server ▪ DB2 Table unloads follow similar rules P11DBA.T90XXXX S4SDWH11.T4S02CTSC_H
  • 19.
    Bastion Hosts ▪ Hadoop Cells are isolated, all accesses MUST go through a bastion host ▪ All accesses to the bastion hosts are authenticated via SSH keys ▪ Users log in using their own user ▪ No SSH port forwarding allowed ▪ All shell commands are logged ▪ Batches scheduled on bastion hosts by $Universe (use of ssh-agent) ▪ Bastion hosts can interact with their HDFS cell (hadoop fs commands) ▪ Bastion hosts can launch jobs ▪ Admin tasks, user provisioning done on NameNode ▪ Kerberos Security not used (yet?) ▪ Need for pluggable security mechanism, using SSH signed tokens
  • 20.
  • 21.
    We are aPiggy bank ... Attribution: www.seniorliving.org
  • 22.
    Why Pig? ▪ We<3 the '1 relation per line' approach, « no SQHell™ » ▪ No metadata service to maintain ▪ Ability to add UDFs ▪ A whole lot already added, more on this later... ▪ Batch scheduling ▪ Can handle all the data we store in HDFS ▪ Still open to other tools (Hive, Zohmg, ...)
  • 23.
    com.arkea.commons.pig.SequenceFileLoadFunc ▪ Generic loadfunction for our BytesWritable SequenceFiles ▪ Relies on Helper classes to interpret the record bytes SequenceFileLoadFunc('HelperClass', 'param', ...) ▪ Helper classes can also be used in regular MapReduce jobs ▪ SequenceFileLoadFunc outputs the following schema { key: bytearray, value: bytearray, parsed: ( Helper dependent schema ) }
  • 24.
    Helper Classes ▪ COBOL– com.arkea.commons.pig.COBOLBinaryHelper ▪ COPYBOOK ▪ Thrift – com.arkea.comons.pig.ThriftBinaryHelper ▪ .class ▪ DB2 Unload – com.arkea.commons.pig.DB2UnloadBinaryHelper ▪ DDL + load script ▪ MySQL – com.arkea.commons.pig.MySQLBinaryHelper ▪ DDL ▪ ...
  • 25.
    Initial Pig Target 'proc sql' SAS Corpus from sample to population Need to give users tools that can reproduce what they did in their scripts
  • 26.
    Groovy Closure PigUDF DEFINE InlineGroovyUDF cac.pig.udf.GroovyClosure(SCHEMA, CODE); DEFINE FileGroovyUDF cac.pig.udf.GroovyClosure(SCHEMA, '/path/to/closure.groovy'); SCHEMA uses the standard Pig Schema syntax, i.e. 'str: chararray' CODE is a short Groovy Closure, i.e. '{ a,b,c -> return a.replaceAll(b,c); }' closure.groovy must be in a REGISTERed jar under path/to
  • 27.
    // // Import statements // import....; // // Constants definitions // /** * Documentation for XXX */ final def XXX = ....; // // Closure definition // /** * Documentation for CLOSURE * * @param a ... * @param b ... * @param ... * * @return ... */ final def CLOSURE = { a,b,... -> ... ... return ...; } // // Unit Tests // // Test specific comment ... assert CLOSURE('A') == ...; // // Return Closure for usage in Pig // return CLOSURE;
  • 28.
    Pig to Groovy bag-> java.util.List tuple -> Object[] map -> java.util.Map int -> int long -> long float -> float double -> double chararray -> java.lang.String bytearray -> byte[] Groovy to Pig groovy.lang.Tuple -> tuple Object[] -> tuple java.util.List -> bag java.util.Map -> map byte/short/int -> int long/BigInteger -> long float -> float double/BigDecimal -> double java.lang.String -> chararray byte[] -> bytearray
  • 29.
  • 30.
    ⊕ ▪ Fast andrich data pipeline between z/OS and Hadoop ▪ Pig Toolbox to analyze COBOL/DB2 data alongside Thrift/MySQL/xSV/... ▪ Groovy Closure support for rapid extension ▪ Still some missing features Pure Java compression codecs (JNI on z/OS anyone?) Pig support for BigInteger / BigDecimal (245 might not be enough) SSH(RSA) based auth tokens ▪ And yet another hard challenge: Cultural Change
  • 31.
  • 32.
  • 33.
    com.arkea.commons.pig.COBOLBinaryHelper REGISTER copybook.jar; A =LOAD '$data' USING cacp.SequenceFileLoadFunc('cacp.COBOLBinaryHelper','[PREFIX:]COPYBOOK'); 000010*GAR* OS Y7XRRDC DESCRIPTION RRDC NOUVAU FORMAT 30000020 000020* LG=00328, ESD MAJ LE 04/12/98, ELS MAJ LE 26/01/01 PAR C98310 30000030 000030* GENERE LE 26/01/01 A 17H01, PFX : Y7XRRD- MEMBRE : Y7XRRDC 30000040 000040 01 Y7XRRD-Y7XRRDC. 30000050 000050* DESCRIPTION RRDC NOUVAU FORMAT 1 04/12/98 30000060 000060 03 Y7XRRD-ARTDS-CLE-SECD. 30000070 A: { 000070* CLE SECONDAIRE ARCHIVAGE TENU DE SOLDE 1 11/02/98 30000080 key: bytearray, 000080 05 Y7XRRD-NO-CCM PIC X(4). 30000090 000090* NUMERO CAISSE 1 28/12/94 30000100 value: bytearray, 000100 05 Y7XRRD-NO-PSE PIC X(8). 30000110 parsed: ( 000110* NUMERO PERSONNE 5 10/07/97 30000120 Y7XRRD_Y7XRRDC: bytearray, 000120 05 Y7XRRD-CATEGORIE PIC X(2). 30000130 000130* CATéGORIE DU COMPTE 13 09/01/01 30000140 Y7XRRD_ARTDS_CLE_SECD: bytearray, 000140 05 Y7XRRD-RANG PIC X(2). 30000150 Y7XRRD_NO_CCM: chararray, 010010* RANG 15 22/01/01 30000160 010020 05 Y7XRRD-NO-ORDRE PIC X(2). 30000170 Y7XRRD_NO_PSE: chararray, 010030* Numéro d'ordre 17 28/12/94 30000180 Y7XRRD_CATEGORIE: chararray, 010040 05 Y7XRRD-DA-TT-C2 PIC X(8). 30000190 010050* DATE TRAITEMENT SX:-C2 19 - - 30000200 Y7XRRD_RANG: chararray, 010060 05 Y7XRRD-NO-ORDRE-ENR-C2 PIC 9(6). 30000210 Y7XRRD_NO_ORDRE: chararray, 010070* Numéro d'ordre enregistrement SX:-C2 27 - - 30000220 010080 03 Y7XRRD-MT-OPE-TDS PIC S9(13)V9(2) COMP-3. 30000230 Y7XRRD_DA_TT_C2: chararray, 010090* MONTANT OPERATION TENUE-DE-SOLDE 33 03/02/98 30000240 Y7XRRD_NO_ORDRE_ENR_C2: long, 010100 03 Y7XRRD-CD-DVS-ORI-OPE PIC X(4). 30000250 010110* CODE DEVISE ORIGINE OPERATION 41 - - 30000260 Y7XRRD_MT_OPE_TDS: double, 010120 03 Y7XRRD-CD-DVS-GTN-TDS PIC X(4). 30000270 Y7XRRD_CD_DVS_ORI_OPE: chararray, 010130* CODE DEVISE GESTION TENUE-DE-SOLDE 45 - - 30000280 Y7XRRD_CD_DVS_GTN_TDS: chararray, 010140 03 Y7XRRD-MT-CNVS-OPE PIC S9(13)V9(2) COMP-3. 30000290 020010* MONTANT CONVERTI OPERATION 49 - - 30000300 Y7XRRD_MT_CNVS_OPE: double, 020020 03 Y7XRRD-IDC-ATN-ORI-MT PIC X(1). 30000310 Y7XRRD_IDC_ATN_ORI_MT: chararray, 020030* INDICATEUR AUTHENTICITE ORIGINE MONTAN 57 05/12/97 30000320 020040 03 Y7XRRD-SLD-AV-IMPT PIC S9(13)V9(2) COMP-3. 30000330 Y7XRRD_SLD_AV_IMPT: double, 020050* SOLDE AVANT IMPUTATION 58 03/02/98 30000340 Y7XRRD_DA_OPE_TDS: chararray, 020060 03 Y7XRRD-DA-OPE-TDS PIC X(8). 30000350 020070* DATE OPERATION TENUE-DE-SOLDE 66 - - 30000360 Y7XRRD_DA_VLR: chararray, 020080 03 Y7XRRD-DA-VLR PIC X(8). 30000370 Y7XRRD_DA_ARR: chararray, 020090* DATE VALEUR 74 28/12/94 30000380 020100 03 Y7XRRD-DA-ARR PIC X(8). 30000390 Y7XRRD_NO_STR_OPE: chararray, 020110* DATE ARRETE 82 - - 30000400 Y7XRRD_NO_REF_TNL_MED: chararray, 020120 03 Y7XRRD-NO-STR-OPE PIC X(6). 30000410 Y7XRRD_NO_LOT: chararray, 020130* NUMERO STRUCTURE OPERATIONNELLE 90 - - 30000420 020140 03 Y7XRRD-NO-REF-TNL-MED PIC X(4). 30000430 Y7XRRD_TDS_LIBELLES: bytearray, 030010* NUMERO REFERENCE TERMINAL MEDIA 96 03/02/98 30000440 Y7XRRD_LIB_CLI_OPE_1: chararray, 030020 03 Y7XRRD-NO-LOT PIC X(3). 30000450 030030* NUMéRO DE LOT 100 13/10/97 30000460 Y7XRRD_LIB_ITE_OPE: chararray, 030040 03 Y7XRRD-TDS-LIBELLES. 30000470 Y7XRRD_LIB_CT_CLI: chararray, 030050* FAMILLE MONTANTS OPERATION T.DE.SOLDE 103 05/02/98 30000480 030060 05 Y7XRRD-LIB-CLI-OPE-1 PIC X(50). 30000490 Y7XRRD_CD_UTI_LIB_CPL: chararray, 030070* LIBELLE CLIENT OPERATION SX:-1 103 03/02/98 30000500 Y7XRRD_IDC_COM_OPE: chararray, 030080 05 Y7XRRD-LIB-ITE-OPE PIC X(32). 30000510 030090* LIBELLE INTERNE OPERATION 153 - - 30000520 Y7XRRD_CD_TY_OPE_NIV_1: chararray, 030100 05 Y7XRRD-LIB-CT-CLI PIC X(32). 30000530 Y7XRRD_CD_TY_OPE_NIV_2: chararray, 030110* LIBELLE COURT CLIENT 185 - - 30000540 FILLER: chararray, 030120 03 Y7XRRD-CD-UTI-LIB-CPL PIC X(1). 30000550 030130* Code utilisation libellés compl. 217 28/12/94 30000560 Y7XRRD_TDS_LIB_SUPPL: bytearray, 030140 03 Y7XRRD-IDC-COM-OPE PIC X(1). 30000570 Y7XRRD_LIB_CLI_OPE_02: chararray, 040010* INDICATEUR COMMISSION OPERATION 218 03/02/98 30000580 040020 03 Y7XRRD-CD-TY-OPE-NIV-1 PIC X(1). 30000590 Y7XRRD_LIB_CLI_OPE_03: chararray 040030* CODE TYPE OPERATION NIVEAU UN 219 - - 30000600 ) 040040 03 Y7XRRD-CD-TY-OPE-NIV-2 PIC X(2). 30000610 040050* CODE TYPE OPERATION NIVEAU DEUX 220 - - 30000620 } 040060 03 FILLER PIC X(7). 30000630 040070* 222 30000640 040080 03 Y7XRRD-TDS-LIB-SUPPL. 30000650 040090* FAMILLE LIBELLES COMPLEMENTAIRES T.D.S 229 17/02/98 30000660 040100 05 Y7XRRD-LIB-CLI-OPE-02 PIC X(50). 30000670 040110* LIBELLE CLIENT OPERATION SX:-02 229 03/02/98 30000680 040120 05 Y7XRRD-LIB-CLI-OPE-03 PIC X(50). 30000690 040130* LIBELLE CLIENT OPERATION SX:-03 279 - - 30000700
  • 34.
    com.arkea.commons.pig.DB2UnloadBinaryHelper REGISTER ddl-load.jar; A = LOAD '$data' USING cacp.SequenceFileLoadFunc('cacp.DB2UnloadBinaryHelper','[PREFIX:]TABLE'); CREATE TABLE SHDBA.TBDCOLS (COL_CHAR CHAR(4) FOR SBCS DATA WITH DEFAULT NULL, COL_DECIMAL DECIMAL(15, 2) WITH DEFAULT NULL, COL_NUMERIC DECIMAL(15, 0) WITH DEFAULT NULL, .ddl COL_SMALLINT SMALLINT WITH DEFAULT NULL, COL_INTEGER INTEGER WITH DEFAULT NULL, A: { COL_VARCHAR VARCHAR(50) FOR SBCS DATA WITH DEFAULT NULL, key: bytearray, COL_DATE DATE WITH DEFAULT NULL, value: bytearray, COL_TIME TIME WITH DEFAULT NULL, parsed: ( COL_TIMESTAMP TIMESTAMP WITH DEFAULT NULL) ; COL_CHAR: chararray, COL_DECIMAL: double, COL_NUMERIC: long, COL_SMALLINT: long, TEMPLATE DFEM8ERT COL_INTEGER: long, DSN('XXXXX.PPSDR.B99BD02.SBDCOLS.REC') COL_VARCHAR: chararray, DISP(OLD,KEEP,KEEP) COL_DATE: chararray, LOAD DATA INDDN DFEM8ERT LOG NO RESUME YES COL_TIME: chararray, EBCDIC CCSID(01147,00000,00000) COL_TIMESTAMP: INTO TABLE "SHDBA"."TBDCOLS" chararray WHEN(00001:00002) = X'003F' ) .load ( "COL_CHAR" POSITION( 00004:00007) CHAR(00004) NULLIF(00003)=X'FF', } "COL_DECIMAL" POSITION( 00009:00016) DECIMAL NULLIF(00008)=X'FF', "COL_NUMERIC" POSITION( 00018:00025) DECIMAL NULLIF(00017)=X'FF', "COL_SMALLINT" POSITION( 00027:00028) SMALLINT NULLIF(00026)=X'FF', "COL_INTEGER" POSITION( 00030:00033) INTEGER NULLIF(00029)=X'FF', "COL_VARCHAR" POSITION( 00035:00086) VARCHAR NULLIF(00034)=X'FF', "COL_DATE" POSITION( 00088:00097) DATE EXTERNAL NULLIF(00087)=X'FF', "COL_TIME" POSITION( 00099:00106) TIME EXTERNAL NULLIF(00098)=X'FF', "COL_TIMESTAMP" POSITION( 00108:00133) TIMESTAMP EXTERNAL NULLIF(00107)=X'FF' ) Can also handle DB2 UDB unloads (done using hpu)
  • 35.
    we're Thrifty too... REGISTER thrift-generated.jar; A = LOAD '$data' USING cacp.SequenceFileLoadFunc('cacp.ThiftBinaryHelper','CLASS'); struct Redirection{ A: { 1: string alias, key: bytearray, 2: string url, value: bytearray, 3: string email, parsed: ( 4: i64 timestamp, alias: chararray, 5: i64 lastupdate, url: chararray, 6: list<string> params, email: chararray, 7: bool external = 1, timestamp: long, 8: i64 owner, lastupdate: long, 9: string user, params: (), } external: long, owner: long, user: chararray ) } ... and also use MySQL ... REGISTER mysql-ddl.jar; A = LOAD '$data' USING cacp.SequenceFileLoadFunc('cacp.MySQLBinaryHelper','TABLE'); ... etc etc etc ...