Financial Data Infrastructure with HDF5
HDF5 is a file format and library that can address the growing need for data storage and management in financial institutions. It allows for [1] limitless data storage through efficient compression and configurable storage granularity, as well as [2] parallel data access and delivery. HDF5 is also infrastructure agnostic and can scale across heterogeneous environments. When used for financial market data and high frequency trading applications, HDF5 has the potential to store massive datasets needed for predictive analytics and algorithmic trading strategies.
2. HDF5 - The Big Idea
For IT Management
Infrastructure Independence
Parallel Data Delivery Configurable To Individual Data Set Granularity
Limitless Data Storage
Optimised Data Storage
Szip compression minimises disk usage/maximises revenue
Suited to heterogeneous environment
Virtual File Layer (VFL) ported to many platforms
A Solution to the ever growing “Data Storage” Issue
For Analytics/Market Data Developers
Potential to capture limitless market depth and generate limitless analytical models
Arbitrary precision, multidimensional and user defined data sets
Toolkits in many flavours, C, Java, Perl
High performance data access
Statistical analysis, 3 D visualisation and pattern recognition become a reality
3. HDF5 - Virtual File Layer
Storage Mechanism is Under Application Control
e.g network, memory, remote file systems, different file systems on the same machine
or to specify special-purpose I/O mechanisms such as streaming I/O, MPI I/O, and
buffered I/O.
Public APIs so that application developers can write new drivers
4. HDF5 - Data Manipulation
Handling Big Data sets
Unix file system limit (32-bit) is 2GB - you have to write a program to shift data
HDF5 analogous to albumwrap
De-aggregation/Re-aggregation of Data Sets
Data set can be composed of many discrete sets
Easier manipulation of data
Sets can be aggregated to form a production set
Hyperslabs
Data subset selection
Transformations during I/O
2D spatial to 3D spatial
5. HDF5 - File Formats
Chunking
efficient parallel data access
Compression
Pluggable compression mechanism (Szip)
Extensible
Allow for expansion - data can be written later
Extensible along any dimension
Raw
Efficient local I/O bypassing the buffer cache
External
Taking advantage of different file systems (JBOD e.g)
6. The Changing Business and Technology Environment
The Move Towards Program Trading
Prediction of 70% of current activity will be automated
High Frequency Finance
Models require high frequency data
Internet Technologies
The Semantic Web: *ML, RDF, RSS
Encryption, Digital Value, Electronic Contracts, Business Intelligence, Psuedonymity
Emerging marketplaces, virtual economies and communities
Defence Technology Commercialisation
Traffic Analysis
Operational Research
Open Source Intelligence
Better, Faster Infrastructure and Software
Optical Networks
LAMP Architecture
COTS Hardware: Opteron, 1GE, SATA
7. HDF5 Feature Overview
HDF5 File Format
Public Domain, pioneered by the nuclear science community
Robust, mature, standards driven
Scalable Data Delivery, Efficient Storage, Data Transformation
Virtual file Layer supports “chunked” data sets
Raw, Standard, Parallel and Networked I/O
Bandwidth configurable per data set
Data type and spatial transforms of data or subsets during I/O
Szip - high performance compression/decompression
Infrastructure agnostic
Metadata approach
No specialised hardware required
Suitable for distributed/lightweight architectures: Grid, COTS
8. Today’s Infrastructure
Architecture in the City
Record orientated approach to data
Flat delimited files, ftp and databases
Time series - a black art
Poor communication between architecture, infrastructure and developers
Centralised/Cross-business infrastructure failures
Hardware/Manufacturer Specific Solutions
Generic, complex, high-cost, manufacturer lock-in, technological mediocrity
Designed to suit Manufacturer’s capability model
Lack of Technology Knowledge Management
Architecture is vendor and developer driven, should be experience and business driven.
No framework for conversion of tacit to explicit knowledge results in loss of operational
expertise.
Solution: Development portals (CM, Bug tracking etc,) Twiki’s, Blogging
9. Grid Infrastructure
Centralised Data Centres
Centralised data means centralised risk.
Extreme risk events would render business continuity planning ineffective.
Huge energy requirements (15MW, 5MW of which is cooling)
Data is Mobile - Not All Data Needs Enterprise Class Persistence
HDF5 makes it easy to forward cache static/reference data calved from master data
sets
Regional/Divisional/Departmental/Workgroup
Real-time computational derivation using FPGA’s and/or calculation farms
Reduced cost whilst maintaining regulatory compliance
Micro-hosting - Crisis Resilient Grid Architecture
Software chooses the most appropriate execution environment and marshals data
accordingly
Each site operational has 20-30 low cost COTS nodes, minimal cooling, energy
footprint up to 15KW, multiple network connections.
KNURR Secure 10/20KW Water cooled cabinets located across infrastructure [3]
10. HDF5 - Enabling High-Frequency Finance
HFF Drivers
Advances in statistical and Operational Research techniques
Improved access to data sets by academia.
Advances in computer infrastructure, reduction in storage costs
The need to find a trading edge.
Quantitative approach augmented by statistical analysis
HDF5 Global Data Repository For Raptor [2]
Project under way at Deutsche Bank, part of the Raptor infrastructure
Raptor requires massive data sets
Providing “early warning” predictive analysis
Pre transaction validation of complex derivatives transactions
Synthesised Data: yield curves, volatility surfaces (time/strike), vwap
News capture
Sentiment Analysis
Traffic Analysis
11. Statistical Data Modelling
Massive Data Sets
Required to determine the statistical significance of outliers in data
Visualisation
Point cloud identification of immediate outliers
relationships by parallel co-ordinate plotting
12. Statistical Data Modelling
Cluster Analysis
Useful to focus further analysis, and data cleansing.
Focus on regions in data space where object is dense
Proprietary Filtering Algorithms
to determine number and significance of clusters
Density Estimation
Visualisation of stochastics
Optimal means to cluster
But - real-time computation is difficult
MATLAB
particularly rich in methods for this
HDF5 enabled
13. Time Series Data
Relationships Between Time Series Data Important
potentially profitable in High Frequency Finance
Used to determine ideal hedge etc.
Exploratory Analysis by Parallel Co-ordinate Plots
followed by Econometric methods to determine significance.
14. Point Clouds
Visualisation Technique for Large Data Sets
a slice of trades from a multidimensional trade data set.
Approximately 29K trades being displayed = 2 weeks (VOD).
15. Data Depth Perspective
Figures Are Humanly Incomprehensible
The World Produced in 1999 [1]:
1.5 exabytes (260) of storable content - 1.5 billion gigabytes
250 megabytes for every man, woman, and child on earth.
Printed documents of all kinds make up only .003 percent of the total.
Magnetic storage is by far the largest medium for storing information and is the most
rapidly growing
Shipped hard-drive capacity doubling every year.
Amount of human generated content - 5TB
Financial Market Data
LSE Basic set is 14GB for 2 Years (stock, shares, price, bid, ask, flags)
Market Depth + News + Traffic Analysis
VWAP + Volatilities
Many Terabytes required
16. HFF/HDF5 Infrastructure
Lightweight Infrastructure Using COTS Components
Pattern Recognition/DSS Node - ~$25,000 for 24TB JBOD Node
Sparse Data Analysis
Analytics/HDF5 node $2700
Data delivery, computation
“Throwaway nodes” - reduced hosting costs
18. HDF5 Data Distribution Architecture
Tier 1 - Master Data Sets
Enterprise grade persistence
Satisfy Data Retention Regulations:
Gramm-Leach-Bliley - security and confidentiality
Sarbanes-Oxley - the need for data retention
I/O profile per data set to suit predefined SLA
Tier 2 - Derived Regional Data sets
Geo-legislative Data Partitioning
Tier 3 - Divisional/Departmental Data Sets
Reduced infrastructure requirements
Forward Caching - data near point of consumption
Dataset Enrichment - pattern recognition, aggregation, data set generation
Tier 4 - Workstation
Spare-cycle computing
Specialist enrichment
20. HDF5 Tier Functionality
Tier 1 - Master Data Sets
Contiguous Tick Data Sets per exchange
Data wraps - aggregated small sets
Tier 2 - Derived Regional Data sets
Pre-partitioned by stock/year/month/week
Delivered to lightweight HDF5 near-network server nodes
Tick Data collection, Location-based analysis, aggregation and enhancement
Data set discovery using Directory Services
Tier 3 - Divisional/Departmental Data Sets
Forward Caching - data near point of consumption
Data set Enrichment - pattern recognition, aggregation, data set generation
DSS/Pattern Recognition nodes - JBOD Consumption - high performance I/O
HDF5 delivery to T4
Tier 4 - Workstation/Desktop Supercomputer
Spare-cycle computing
Data enrichment using specialist hardware/architecture (FPGA)
21. Data Discovery
Ontologies
An ontology is a conceptual model about some domain
Relationships that hold between them
Characteristics of data
Data set Description using Protégé and OWL
XML/RDF Metadata
Can forward generate Database and XML Schema’s
Data Classification
WEKA - data classification suite written in Java
Pattern Recognition
News analysis
Envelope/Outlier analysis
23. Security Framework
Basic Access Control Primitives
Advisory security mechanism
Read-only, read-write etc - stored in file meta data
Host based access control
Global/Federated Operation Requires Third-party Access Control Manager
Xboost from www.ivis.com
Integration to enterprise directories - both .Net and LDAP
Service Orientated Architecture:
24. HDF5 Data Format
HDF5 File
is a container for storing a variety of scientific data is composed of two primary types of objects:
groups and data sets.
HDF5 Group
a grouping structure containing zero or more HDF5 objects, together with supporting metadata
HDF5 Data Set
a multidimensional array of data elements, together with supporting metadata
Similar to working with directories and files in UNIX: an HDF5 object in an HDF5 file is often referred to by its full path
name (also called an absolute path name).
/ signifies the root group.
/foo signifies a member of the root group called foo.
/foo/zoo signifies a member of the group foo, which in turn is a member of the root group.
HDF5 Attribute List
Any HDF5 group or data set may have an associated attribute list. An HDF5 attribute is a user-
defined HDF5 structure that provides extra information about an HDF5 object.
25. HDF5 Library
The HDF5 library
provides several interfaces, or APIs. These APIs provide routines for creating, accessing, and
manipulating HDF5 files and objects.
The library itself is implemented in C.
HDF5 function wrappers have been developed in Java/Fortran90
All C routines in the HDF5 library begin with a prefix of the form H5*, where * is one or two
uppercase letters indicating the type of object on which the function operates.
API
HF - Library Functions: general-purpose H5 functions
H5A - Annotation Interface: attribute access and manipulation routines
H5D - Data set Interface: data set access and manipulation routines
H5E - Error Interface: error handling routines
H5F- File Interface: file access routines
H5G- Group Interface: group creation and operation routines
H5I - Identifier Interface: identifier routines
H5P- Property List Interface: object property list manipulation routines
H5R - Reference Interface: reference routines
H5S - Data space Interface: data space definition and access routines
H5T - Data type Interface: data type creation and manipulation routines
H5Z - Compression Interface: compression routine(s)
26. HDF5 File Creation
#include "hdf5.h"
#define FILE "dset.h5”
main() {
hid_t file_id, dataset_id, dataspace_id; /* identifiers */
hsize_t dims[2];
herr_t status;
/* Create a new file using default properties. */
file_id = H5Fcreate(FILE, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
/* Create the data space for the dataset. */
dims[0] = 4;
dims[1] = 6;
dataspace_id = H5Screate_simple(2, dims, NULL);
/* Create the dataset. */
dataset_id = H5Dcreate(file_id, "/dset", H5T_STD_I32BE, dataspace_id,
H5P_DEFAULT);
/* End access to the dataset and release resources used by it. */
status = H5Dclose(dataset_id);
/* Terminate access to the data space. */
status = H5Sclose(dataspace_id);
/* Close the file. */
status = H5Fclose(file_id);
}
27. HDF5 Reading and Writing Existing Datasets
#include "hdf5.h"
/*
#define FILE "dset.h5"
* Writing and reading an existing dataset.
main() {
*/
hid_t file_id, dataset_id; /* identifiers */
herr_t status;
int#include "hdf5.h"
i, j, dset_data[4][6];
#define FILE "dset.h5"
/* Initialise the dataset. */
for (i = 0; i < 4; i++)
formain() j{< 6; j++)
(j = 0;
dset_data[i][j] = i * 6 + j + 1;
/* hid_t existing file. */
Open an file_id, dataset_id; /* identifiers */
file_id = H5Fopen(FILE, H5F_ACC_RDWR, H5P_DEFAULT);
herr_t status;
/* int an existing dataset. */
Open i, j, dset_data[4][6];
dataset_id = H5Dopen(file_id, "/dset");
/* /* Initialize */
Write the dataset. the dataset. */
status = H5Dwrite(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT,dset_data);
for (i = 0; i < 4; i++)
status = (j = 0; j < 6; H5T_NATIVE_INT, H5S_ALL,
for H5Dread(dataset_id, j++) H5S_ALL, H5P_DEFAULT,dset_data);
/* dset_data[i][j] = i * 6 + j + 1;
Close the dataset. */
status = H5Dclose(dataset_id);
/* /* Open file. existing file. */
Close the an */
status = H5Fclose(file_id);
file_id = H5Fopen(FILE, H5F_ACC_RDWR, H5P_DEFAULT);
}
/* Open an existing dataset. */
dataset_id = H5Dopen(file_id, "/dset");
/* Write the dataset. */
status = H5Dwrite(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL,
H5P_DEFAULT,
28. To Probe Further
HDF5 Home Page
http://hdf.ncsa.uiuc.edu/HDF5/
HDF5 Tutorial
http://www.physics.ohio-state.edu/~wilkins/computing/HDF/hdf5tutorial/index.html
Enhyper Knowledgebase
http://www.enhyper.com/lib
Many grid, finance and operational research related resources
29. References
[1] How Much Storage is Enough?
http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=45
[2] Different Shades of Meaning in the Stock Market
http://www.enhyper.com/content/kerrraptor.jpg
[3] Knurr 10/20KW Water-cooled Environments
http://www.water-cooled-server-rack.com/