5. Why Hadoop?
▪ Ever increasing volume of data
▪ Very regulated sector (Basel II/III, Solvency II)
▪ Need to produce compliance reports
▪ Competitive sector
▪ Need to create value, data identified as a great source of it
▪ Keep costs under control
▪ Fond of Open Source
▪ Engineers like big challenges
9. Types of logical storage
Virtual Storage Access Method
Record-oriented (fixed or variable length) indexed datasets
Physically Sequential
Record-oriented (fixed or variable length) datasets, not indexed
Can exist on different types of media
IBM DB2 Relational Model Database Server
10. Types of binary records stored
COBOL Records (conform to a COPYBOOK)
DB2 'UNLOAD' Records (conform to a DDL statement)
11. Types of data stored in HDFS
{Tab, Comma, ...} Separated Values
One line records of multiple columns
Text
Line-oriented (eg logs)
Hadoop SequenceFiles
Block compressed
▪ Mostly BytesWritable key/value
▪ COBOL records
▪ DB2 unloaded records
▪ Serialized Thrift structures
▪ Use of DefaultCodec (pure Java)
13. Standard data transfer process
▪ On the fly charset conversion
▪ Loss of notion of records
14. Hadoop data transfer process
▪ On the fly compression
▪ Keep original charset
▪ Preserved notion of records
15. Staging Server
▪ Gateway In & Out of an HDFS Cell
▪ Reads/Writes to /hdfs/staging/{in,out}/... (runs as hdfs)
▪ HTTP Based (POST/GET)
▪ Upload to http://hadoop-staging/put[/hdfs/staging/in/...]
Stores directly in HDFS, no intermediary storage
Multiple files support
Random target directory created if none specified
Parameters user, group, perm, suffix
curl -F "file=@local;filename=remote" http://../put?user=foo&group=bar&perm=644&suffix=.test
▪ Download from http://hadoop-staging/get/hdfs/staging/out/...
Ability to unpack SequenceFile records (unpack={base64,hex}) as key:value lines
16. fileutil
▪ Swiss Army Knife for SequenceFiles, HDFS Staging Server, ZooKeeper
▪ Written in Java, single jar
▪ Works in all our environments (z/OS, Unix, Windows, ...)
▪ Can be ran using TWS/OPC on z/OS (via a JCL), $Universe on Unix, cron ...
▪ Multiple commands
sfstage Convert a z/OS dataset to a SF and push it to the staging server
{stream,file}stage Push a stream or files to the staging server
filesfstage Convert a file to a SF (one record per block) and stage it
sfpack Pack key:value lines (cf unpack) in a SequenceFile
sfarchive Create a SequenceFile, one record per input file
zk{ls,lsr,cat,stat}Read data from ZooKeeper
get Retrieve data via URI
...
18. Data Organization
▪ Use of a directory structure that mimics the datasets names
PR0172.PVS00.F7209588
Environment / Silo / Application
/hdfs/data/si/PR/01/72/PR0172.PVS00.F7209588.SUFFIX
▪ Group ACLs at the Environment/Silo/Application levels
▪ Suffix is mainly used to add .YYYYMM to Generation Data Groups
▪ Suffix added by the staging server
▪ DB2 Table unloads follow similar rules
P11DBA.T90XXXX
S4SDWH11.T4S02CTSC_H
19. Bastion Hosts
▪ Hadoop Cells are isolated, all accesses MUST go through a bastion host
▪ All accesses to the bastion hosts are authenticated via SSH keys
▪ Users log in using their own user
▪ No SSH port forwarding allowed
▪ All shell commands are logged
▪ Batches scheduled on bastion hosts by $Universe (use of ssh-agent)
▪ Bastion hosts can interact with their HDFS cell (hadoop fs commands)
▪ Bastion hosts can launch jobs
▪ Admin tasks, user provisioning done on NameNode
▪ Kerberos Security not used (yet?)
▪ Need for pluggable security mechanism, using SSH signed tokens
21. We are a Piggy bank ...
Attribution: www.seniorliving.org
22. Why Pig?
▪ We <3 the '1 relation per line' approach, « no SQHell™ »
▪ No metadata service to maintain
▪ Ability to add UDFs
▪ A whole lot already added, more on this later...
▪ Batch scheduling
▪ Can handle all the data we store in HDFS
▪ Still open to other tools (Hive, Zohmg, ...)
23. com.arkea.commons.pig.SequenceFileLoadFunc
▪ Generic load function for our BytesWritable SequenceFiles
▪ Relies on Helper classes to interpret the record bytes
SequenceFileLoadFunc('HelperClass', 'param', ...)
▪ Helper classes can also be used in regular MapReduce jobs
▪ SequenceFileLoadFunc outputs the following schema
{
key: bytearray,
value: bytearray,
parsed: (
Helper dependent schema
)
}
25. Initial Pig Target
'proc sql' SAS Corpus
from sample to population
Need to give users tools that can reproduce what they did in their scripts
26. Groovy Closure Pig UDF
DEFINE InlineGroovyUDF cac.pig.udf.GroovyClosure(SCHEMA, CODE);
DEFINE FileGroovyUDF cac.pig.udf.GroovyClosure(SCHEMA, '/path/to/closure.groovy');
SCHEMA uses the standard Pig Schema syntax, i.e. 'str: chararray'
CODE is a short Groovy Closure, i.e. '{ a,b,c -> return a.replaceAll(b,c); }'
closure.groovy must be in a REGISTERed jar under path/to
27. //
// Import statements
//
import ....;
//
// Constants definitions
//
/**
* Documentation for XXX
*/
final def XXX = ....;
//
// Closure definition
//
/**
* Documentation for CLOSURE
*
* @param a ...
* @param b ...
* @param ...
*
* @return ...
*/
final def CLOSURE = {
a,b,... ->
...
...
return ...;
}
//
// Unit Tests
//
// Test specific comment ...
assert CLOSURE('A') == ...;
//
// Return Closure for usage in Pig
//
return CLOSURE;
28. Pig to Groovy
bag -> java.util.List
tuple -> Object[]
map -> java.util.Map
int -> int
long -> long
float -> float
double -> double
chararray -> java.lang.String
bytearray -> byte[]
Groovy to Pig
groovy.lang.Tuple -> tuple
Object[] -> tuple
java.util.List -> bag
java.util.Map -> map
byte/short/int -> int
long/BigInteger -> long
float -> float
double/BigDecimal -> double
java.lang.String -> chararray
byte[] -> bytearray
30. ⊕
▪ Fast and rich data pipeline between z/OS and Hadoop
▪ Pig Toolbox to analyze COBOL/DB2 data alongside Thrift/MySQL/xSV/...
▪ Groovy Closure support for rapid extension
▪ Still some missing features
Pure Java compression codecs (JNI on z/OS anyone?)
Pig support for BigInteger / BigDecimal (245 might not be enough)
SSH(RSA) based auth tokens
▪ And yet another hard challenge: Cultural Change