Hdf5

Financial Data Infrastructure
with HDF5

Graeme Burnett
Head of Analytics Architecture, Deutsche Bank
Sep 2004

HDF5 - The Big Idea

For IT Management
Infrastructure Independence
Parallel Data Delivery Configurable To Individual Data Set Granularity
Limitless Data Storage
Optimised Data Storage
Szip compression minimises disk usage/maximises revenue
Suited to heterogeneous environment
Virtual File Layer (VFL) ported to many platforms
A Solution to the ever growing “Data Storage” Issue

For Analytics/Market Data Developers
Potential to capture limitless market depth and generate limitless analytical models
Arbitrary precision, multidimensional and user defined data sets
Toolkits in many flavours, C, Java, Perl
High performance data access
Statistical analysis, 3 D visualisation and pattern recognition become a reality

HDF5 - Virtual File Layer

Storage Mechanism is Under Application Control
e.g network, memory, remote file systems, different file systems on the same machine
or to specify special-purpose I/O mechanisms such as streaming I/O, MPI I/O, and
buffered I/O.
Public APIs so that application developers can write new drivers

HDF5 - Data Manipulation

Handling Big Data sets
Unix file system limit (32-bit) is 2GB - you have to write a program to shift data
HDF5 analogous to albumwrap

De-aggregation/Re-aggregation of Data Sets
Data set can be composed of many discrete sets
Easier manipulation of data
Sets can be aggregated to form a production set

Hyperslabs
Data subset selection
Transformations during I/O
2D spatial to 3D spatial

HDF5 - File Formats

Chunking
efficient parallel data access

Compression
Pluggable compression mechanism (Szip)

Extensible
Allow for expansion - data can be written later
Extensible along any dimension

Raw
Efficient local I/O bypassing the buffer cache

External
Taking advantage of different file systems (JBOD e.g)

The Changing Business and Technology Environment

The Move Towards Program Trading
Prediction of 70% of current activity will be automated
High Frequency Finance
Models require high frequency data

Internet Technologies
The Semantic Web: *ML, RDF, RSS
Encryption, Digital Value, Electronic Contracts, Business Intelligence, Psuedonymity
Emerging marketplaces, virtual economies and communities

Defence Technology Commercialisation
Traffic Analysis
Operational Research
Open Source Intelligence

Better, Faster Infrastructure and Software
Optical Networks
LAMP Architecture
COTS Hardware: Opteron, 1GE, SATA

HDF5 Feature Overview

HDF5 File Format
Public Domain, pioneered by the nuclear science community
Robust, mature, standards driven

Scalable Data Delivery, Efficient Storage, Data Transformation
Virtual file Layer supports “chunked” data sets
Raw, Standard, Parallel and Networked I/O
Bandwidth configurable per data set
Data type and spatial transforms of data or subsets during I/O
Szip - high performance compression/decompression

Infrastructure agnostic
Metadata approach
No specialised hardware required
Suitable for distributed/lightweight architectures: Grid, COTS

Today’s Infrastructure

Architecture in the City
Record orientated approach to data
Flat delimited files, ftp and databases
Time series - a black art
Poor communication between architecture, infrastructure and developers
Centralised/Cross-business infrastructure failures

Hardware/Manufacturer Specific Solutions
Generic, complex, high-cost, manufacturer lock-in, technological mediocrity
Designed to suit Manufacturer’s capability model

Lack of Technology Knowledge Management
Architecture is vendor and developer driven, should be experience and business driven.
No framework for conversion of tacit to explicit knowledge results in loss of operational
expertise.
Solution: Development portals (CM, Bug tracking etc,) Twiki’s, Blogging

Grid Infrastructure

Centralised Data Centres
Centralised data means centralised risk.
Extreme risk events would render business continuity planning ineffective.
Huge energy requirements (15MW, 5MW of which is cooling)

Data is Mobile - Not All Data Needs Enterprise Class Persistence
HDF5 makes it easy to forward cache static/reference data calved from master data
sets
Regional/Divisional/Departmental/Workgroup
Real-time computational derivation using FPGA’s and/or calculation farms
Reduced cost whilst maintaining regulatory compliance

Micro-hosting - Crisis Resilient Grid Architecture
Software chooses the most appropriate execution environment and marshals data
accordingly
Each site operational has 20-30 low cost COTS nodes, minimal cooling, energy
footprint up to 15KW, multiple network connections.
KNURR Secure 10/20KW Water cooled cabinets located across infrastructure [3]

HDF5 - Enabling High-Frequency Finance

HFF Drivers
Advances in statistical and Operational Research techniques
Improved access to data sets by academia.
Advances in computer infrastructure, reduction in storage costs
The need to find a trading edge.
Quantitative approach augmented by statistical analysis

HDF5 Global Data Repository For Raptor [2]
Project under way at Deutsche Bank, part of the Raptor infrastructure
Raptor requires massive data sets
Providing “early warning” predictive analysis
Pre transaction validation of complex derivatives transactions
Synthesised Data: yield curves, volatility surfaces (time/strike), vwap
News capture
Sentiment Analysis
Traffic Analysis

Statistical Data Modelling

Massive Data Sets
Required to determine the statistical significance of outliers in data

Visualisation
Point cloud identification of immediate outliers
relationships by parallel co-ordinate plotting

Statistical Data Modelling

Cluster Analysis
Useful to focus further analysis, and data cleansing.
Focus on regions in data space where object is dense

Proprietary Filtering Algorithms
to determine number and significance of clusters

Density Estimation
Visualisation of stochastics
Optimal means to cluster
But - real-time computation is difficult

MATLAB
particularly rich in methods for this
HDF5 enabled

Time Series Data

Relationships Between Time Series Data Important
potentially profitable in High Frequency Finance
Used to determine ideal hedge etc.

Exploratory Analysis by Parallel Co-ordinate Plots
followed by Econometric methods to determine significance.

Point Clouds

Visualisation Technique for Large Data Sets
a slice of trades from a multidimensional trade data set.
Approximately 29K trades being displayed = 2 weeks (VOD).

Data Depth Perspective

Figures Are Humanly Incomprehensible

The World Produced in 1999 [1]:
1.5 exabytes (260) of storable content - 1.5 billion gigabytes
250 megabytes for every man, woman, and child on earth.
Printed documents of all kinds make up only .003 percent of the total.
Magnetic storage is by far the largest medium for storing information and is the most
rapidly growing
Shipped hard-drive capacity doubling every year.
Amount of human generated content - 5TB

Financial Market Data
LSE Basic set is 14GB for 2 Years (stock, shares, price, bid, ask, flags)
Market Depth + News + Traffic Analysis
VWAP + Volatilities
Many Terabytes required

HFF/HDF5 Infrastructure

Lightweight Infrastructure Using COTS Components
Pattern Recognition/DSS Node - ~$25,000 for 24TB JBOD Node
Sparse Data Analysis
Analytics/HDF5 node $2700
Data delivery, computation
“Throwaway nodes” - reduced hosting costs

HDF5 Data Distribution Architecture

Tier 1 - Master Data Sets
Enterprise grade persistence
Satisfy Data Retention Regulations:
Gramm-Leach-Bliley - security and confidentiality
Sarbanes-Oxley - the need for data retention
I/O profile per data set to suit predefined SLA

Tier 2 - Derived Regional Data sets
Geo-legislative Data Partitioning

Tier 3 - Divisional/Departmental Data Sets
Reduced infrastructure requirements
Forward Caching - data near point of consumption
Dataset Enrichment - pattern recognition, aggregation, data set generation

Tier 4 - Workstation
Spare-cycle computing
Specialist enrichment

HDF5 Tier Functionality

Tier 1 - Master Data Sets
Contiguous Tick Data Sets per exchange
Data wraps - aggregated small sets

Tier 2 - Derived Regional Data sets
Pre-partitioned by stock/year/month/week
Delivered to lightweight HDF5 near-network server nodes
Tick Data collection, Location-based analysis, aggregation and enhancement
Data set discovery using Directory Services

Tier 3 - Divisional/Departmental Data Sets
Forward Caching - data near point of consumption
Data set Enrichment - pattern recognition, aggregation, data set generation
DSS/Pattern Recognition nodes - JBOD Consumption - high performance I/O
HDF5 delivery to T4

Tier 4 - Workstation/Desktop Supercomputer
Spare-cycle computing
Data enrichment using specialist hardware/architecture (FPGA)

Data Discovery

Ontologies
An ontology is a conceptual model about some domain
Relationships that hold between them
Characteristics of data

Data set Description using Protégé and OWL
XML/RDF Metadata
Can forward generate Database and XML Schema’s

Data Classification
WEKA - data classification suite written in Java
Pattern Recognition
News analysis
Envelope/Outlier analysis

Security Framework

Basic Access Control Primitives
Advisory security mechanism
Read-only, read-write etc - stored in file meta data
Host based access control

Global/Federated Operation Requires Third-party Access Control Manager
Xboost from www.ivis.com
Integration to enterprise directories - both .Net and LDAP
Service Orientated Architecture:

HDF5 Data Format

HDF5 File
is a container for storing a variety of scientific data is composed of two primary types of objects:
groups and data sets.

HDF5 Group
a grouping structure containing zero or more HDF5 objects, together with supporting metadata

HDF5 Data Set
a multidimensional array of data elements, together with supporting metadata
Similar to working with directories and files in UNIX: an HDF5 object in an HDF5 file is often referred to by its full path
name (also called an absolute path name).
/ signifies the root group.
/foo signifies a member of the root group called foo.
/foo/zoo signifies a member of the group foo, which in turn is a member of the root group.

HDF5 Attribute List
Any HDF5 group or data set may have an associated attribute list. An HDF5 attribute is a user-
defined HDF5 structure that provides extra information about an HDF5 object.

HDF5 Library

The HDF5 library
provides several interfaces, or APIs. These APIs provide routines for creating, accessing, and
manipulating HDF5 files and objects.
The library itself is implemented in C.
HDF5 function wrappers have been developed in Java/Fortran90
All C routines in the HDF5 library begin with a prefix of the form H5*, where * is one or two
uppercase letters indicating the type of object on which the function operates.

API
HF - Library Functions: general-purpose H5 functions
H5A - Annotation Interface: attribute access and manipulation routines
H5D - Data set Interface: data set access and manipulation routines
H5E - Error Interface: error handling routines
H5F- File Interface: file access routines
H5G- Group Interface: group creation and operation routines
H5I - Identifier Interface: identifier routines
H5P- Property List Interface: object property list manipulation routines
H5R - Reference Interface: reference routines
H5S - Data space Interface: data space definition and access routines
H5T - Data type Interface: data type creation and manipulation routines
H5Z - Compression Interface: compression routine(s)

HDF5 File Creation

#include "hdf5.h"
#define FILE "dset.h5”

main() {
hid_t file_id, dataset_id, dataspace_id; /* identifiers */
hsize_t dims[2];
herr_t status;

/* Create a new file using default properties. */
file_id = H5Fcreate(FILE, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

/* Create the data space for the dataset. */
dims[0] = 4;
dims[1] = 6;
dataspace_id = H5Screate_simple(2, dims, NULL);

/* Create the dataset. */
dataset_id = H5Dcreate(file_id, "/dset", H5T_STD_I32BE, dataspace_id,
H5P_DEFAULT);

/* End access to the dataset and release resources used by it. */
status = H5Dclose(dataset_id);

/* Terminate access to the data space. */
status = H5Sclose(dataspace_id);

/* Close the file. */
status = H5Fclose(file_id);
}

HDF5 Reading and Writing Existing Datasets

#include "hdf5.h"
/*
#define FILE "dset.h5"
* Writing and reading an existing dataset.
main() {
*/
hid_t file_id, dataset_id; /* identifiers */
herr_t status;
int#include "hdf5.h"
i, j, dset_data[4][6];
#define FILE "dset.h5"
/* Initialise the dataset. */
for (i = 0; i < 4; i++)
formain() j{< 6; j++)
(j = 0;
dset_data[i][j] = i * 6 + j + 1;

/* hid_t existing file. */
Open an file_id, dataset_id; /* identifiers */
file_id = H5Fopen(FILE, H5F_ACC_RDWR, H5P_DEFAULT);
herr_t status;
/* int an existing dataset. */
Open i, j, dset_data[4][6];
dataset_id = H5Dopen(file_id, "/dset");

/* /* Initialize */
Write the dataset. the dataset. */
status = H5Dwrite(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT,dset_data);
for (i = 0; i < 4; i++)
status = (j = 0; j < 6; H5T_NATIVE_INT, H5S_ALL,
for H5Dread(dataset_id, j++) H5S_ALL, H5P_DEFAULT,dset_data);

/* dset_data[i][j] = i * 6 + j + 1;
Close the dataset. */
status = H5Dclose(dataset_id);

/* /* Open file. existing file. */
Close the an */
status = H5Fclose(file_id);
file_id = H5Fopen(FILE, H5F_ACC_RDWR, H5P_DEFAULT);
}

/* Open an existing dataset. */
dataset_id = H5Dopen(file_id, "/dset");

/* Write the dataset. */
status = H5Dwrite(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL,
H5P_DEFAULT,

To Probe Further

HDF5 Home Page
http://hdf.ncsa.uiuc.edu/HDF5/

HDF5 Tutorial
http://www.physics.ohio-state.edu/~wilkins/computing/HDF/hdf5tutorial/index.html

Enhyper Knowledgebase
http://www.enhyper.com/lib
Many grid, finance and operational research related resources

References

[1] How Much Storage is Enough?
http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=45

[2] Different Shades of Meaning in the Stock Market
http://www.enhyper.com/content/kerrraptor.jpg

[3] Knurr 10/20KW Water-cooled Environments
http://www.water-cooled-server-rack.com/

Hdf5

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hdf5

Similar to Hdf5 (20)

More from Smith Kim

More from Smith Kim (20)

Hdf5