SlideShare a Scribd company logo
Financial Data Infrastructure
                               with HDF5

Graeme Burnett
Head of Analytics Architecture, Deutsche Bank
Sep 2004
HDF5 - The Big Idea


    For IT Management
          Infrastructure Independence
          Parallel Data Delivery Configurable To Individual Data Set Granularity
          Limitless Data Storage
          Optimised Data Storage
                 Szip compression minimises disk usage/maximises revenue
          Suited to heterogeneous environment
                 Virtual File Layer (VFL) ported to many platforms
          A Solution to the ever growing “Data Storage” Issue

    For Analytics/Market Data Developers
         Potential to capture limitless market depth and generate limitless analytical models
         Arbitrary precision, multidimensional and user defined data sets
         Toolkits in many flavours, C, Java, Perl
         High performance data access
                Statistical analysis, 3 D visualisation and pattern recognition become a reality
HDF5 - Virtual File Layer




    Storage Mechanism is Under Application Control
         e.g network, memory, remote file systems, different file systems on the same machine
         or to specify special-purpose I/O mechanisms such as streaming I/O, MPI I/O, and
         buffered I/O.
         Public APIs so that application developers can write new drivers
HDF5 - Data Manipulation


    Handling Big Data sets
         Unix file system limit (32-bit) is 2GB - you have to write a program to shift data
         HDF5 analogous to albumwrap

    De-aggregation/Re-aggregation of Data Sets
         Data set can be composed of many discrete sets
         Easier manipulation of data
         Sets can be aggregated to form a production set

    Hyperslabs
         Data subset selection
         Transformations during I/O
                2D spatial to 3D spatial
HDF5 - File Formats


    Chunking
          efficient parallel data access

    Compression
          Pluggable compression mechanism (Szip)

    Extensible
          Allow for expansion - data can be written later
          Extensible along any dimension

    Raw
          Efficient local I/O bypassing the buffer cache

    External
          Taking advantage of different file systems (JBOD e.g)
The Changing Business and Technology Environment


    The Move Towards Program Trading
         Prediction of 70% of current activity will be automated
         High Frequency Finance
               Models require high frequency data

    Internet Technologies
         The Semantic Web: *ML, RDF, RSS
         Encryption, Digital Value, Electronic Contracts, Business Intelligence, Psuedonymity
         Emerging marketplaces, virtual economies and communities

    Defence Technology Commercialisation
         Traffic Analysis
         Operational Research
         Open Source Intelligence

    Better, Faster Infrastructure and Software
         Optical Networks
         LAMP Architecture
         COTS Hardware: Opteron, 1GE, SATA
HDF5 Feature Overview


    HDF5 File Format
         Public Domain, pioneered by the nuclear science community
         Robust, mature, standards driven

    Scalable Data Delivery, Efficient Storage, Data Transformation
         Virtual file Layer supports “chunked” data sets
         Raw, Standard, Parallel and Networked I/O
         Bandwidth configurable per data set
         Data type and spatial transforms of data or subsets during I/O
         Szip - high performance compression/decompression

    Infrastructure agnostic
         Metadata approach
         No specialised hardware required
         Suitable for distributed/lightweight architectures: Grid, COTS
Today’s Infrastructure


    Architecture in the City
         Record orientated approach to data
               Flat delimited files, ftp and databases
               Time series - a black art
         Poor communication between architecture, infrastructure and developers
         Centralised/Cross-business infrastructure failures

    Hardware/Manufacturer Specific Solutions
         Generic, complex, high-cost, manufacturer lock-in, technological mediocrity
         Designed to suit Manufacturer’s capability model

    Lack of Technology Knowledge Management
         Architecture is vendor and developer driven, should be experience and business driven.
         No framework for conversion of tacit to explicit knowledge results in loss of operational
         expertise.
         Solution: Development portals (CM, Bug tracking etc,) Twiki’s, Blogging
Grid Infrastructure


     Centralised Data Centres
          Centralised data means centralised risk.
          Extreme risk events would render business continuity planning ineffective.
          Huge energy requirements (15MW, 5MW of which is cooling)

     Data is Mobile - Not All Data Needs Enterprise Class Persistence
          HDF5 makes it easy to forward cache static/reference data calved from master data
          sets
                Regional/Divisional/Departmental/Workgroup
          Real-time computational derivation using FPGA’s and/or calculation farms
          Reduced cost whilst maintaining regulatory compliance

     Micro-hosting - Crisis Resilient Grid Architecture
          Software chooses the most appropriate execution environment and marshals data
          accordingly
          Each site operational has 20-30 low cost COTS nodes, minimal cooling, energy
          footprint up to 15KW, multiple network connections.
          KNURR Secure 10/20KW Water cooled cabinets located across infrastructure [3]
HDF5 - Enabling High-Frequency Finance


    HFF Drivers
         Advances in statistical and Operational Research techniques
         Improved access to data sets by academia.
         Advances in computer infrastructure, reduction in storage costs
         The need to find a trading edge.
         Quantitative approach augmented by statistical analysis

    HDF5 Global Data Repository For Raptor [2]
         Project under way at Deutsche Bank, part of the Raptor infrastructure
                Raptor requires massive data sets
         Providing “early warning” predictive analysis
         Pre transaction validation of complex derivatives transactions
         Synthesised Data: yield curves, volatility surfaces (time/strike), vwap
         News capture
         Sentiment Analysis
         Traffic Analysis
Statistical Data Modelling


    Massive Data Sets
         Required to determine the statistical significance of outliers in data

    Visualisation
         Point cloud identification of immediate outliers
         relationships by parallel co-ordinate plotting
Statistical Data Modelling


    Cluster Analysis
         Useful to focus further analysis, and data cleansing.
         Focus on regions in data space where object is dense

    Proprietary Filtering Algorithms
         to determine number and significance of clusters

    Density Estimation
         Visualisation of stochastics
         Optimal means to cluster
         But - real-time computation is difficult

    MATLAB
         particularly rich in methods for this
         HDF5 enabled
Time Series Data


    Relationships Between Time Series Data Important
        potentially profitable in High Frequency Finance
        Used to determine ideal hedge etc.

    Exploratory Analysis by Parallel Co-ordinate Plots
        followed by Econometric methods to determine significance.
Point Clouds


    Visualisation Technique for Large Data Sets
         a slice of trades from a multidimensional trade data set.
         Approximately 29K trades being displayed = 2 weeks (VOD).
Data Depth Perspective

     Figures Are Humanly Incomprehensible

     The World Produced in 1999 [1]:
         1.5 exabytes (260) of storable content - 1.5 billion gigabytes
         250 megabytes for every man, woman, and child on earth.
         Printed documents of all kinds make up only .003 percent of the total.
         Magnetic storage is by far the largest medium for storing information and is the most
         rapidly growing
         Shipped hard-drive capacity doubling every year.
         Amount of human generated content - 5TB

     Financial Market Data
         LSE Basic set is 14GB for 2 Years (stock, shares, price, bid, ask, flags)
         Market Depth + News + Traffic Analysis
         VWAP + Volatilities
         Many Terabytes required
HFF/HDF5 Infrastructure


    Lightweight Infrastructure Using COTS Components
         Pattern Recognition/DSS Node - ~$25,000 for 24TB JBOD Node
               Sparse Data Analysis
         Analytics/HDF5 node $2700
               Data delivery, computation
         “Throwaway nodes” - reduced hosting costs
Data Topography
HDF5 Data Distribution Architecture


    Tier 1 - Master Data Sets
         Enterprise grade persistence
         Satisfy Data Retention Regulations:
               Gramm-Leach-Bliley - security and confidentiality
               Sarbanes-Oxley - the need for data retention
         I/O profile per data set to suit predefined SLA

    Tier 2 - Derived Regional Data sets
         Geo-legislative Data Partitioning

    Tier 3 - Divisional/Departmental Data Sets
         Reduced infrastructure requirements
         Forward Caching - data near point of consumption
         Dataset Enrichment - pattern recognition, aggregation, data set generation

    Tier 4 - Workstation
         Spare-cycle computing
         Specialist enrichment
HDF5 Data Delivery
HDF5 Tier Functionality


    Tier 1 - Master Data Sets
         Contiguous Tick Data Sets per exchange
         Data wraps - aggregated small sets

    Tier 2 - Derived Regional Data sets
         Pre-partitioned by stock/year/month/week
         Delivered to lightweight HDF5 near-network server nodes
         Tick Data collection, Location-based analysis, aggregation and enhancement
         Data set discovery using Directory Services

    Tier 3 - Divisional/Departmental Data Sets
         Forward Caching - data near point of consumption
         Data set Enrichment - pattern recognition, aggregation, data set generation
         DSS/Pattern Recognition nodes - JBOD Consumption - high performance I/O
         HDF5 delivery to T4

    Tier 4 - Workstation/Desktop Supercomputer
         Spare-cycle computing
         Data enrichment using specialist hardware/architecture (FPGA)
Data Discovery


    Ontologies
         An ontology is a conceptual model about some domain
         Relationships that hold between them
         Characteristics of data

    Data set Description using Protégé and OWL
         XML/RDF Metadata
         Can forward generate Database and XML Schema’s

    Data Classification
         WEKA - data classification suite written in Java
         Pattern Recognition
         News analysis
         Envelope/Outlier analysis
Ontology Modelling
Security Framework

   Basic Access Control Primitives
        Advisory security mechanism
        Read-only, read-write etc - stored in file meta data
        Host based access control

   Global/Federated Operation Requires Third-party Access Control Manager
        Xboost from www.ivis.com
        Integration to enterprise directories - both .Net and LDAP
        Service Orientated Architecture:
HDF5 Data Format

    HDF5 File
         is a container for storing a variety of scientific data is composed of two primary types of objects:
         groups and data sets.

    HDF5 Group
         a grouping structure containing zero or more HDF5 objects, together with supporting metadata

    HDF5 Data Set
         a multidimensional array of data elements, together with supporting metadata
         Similar to working with directories and files in UNIX: an HDF5 object in an HDF5 file is often referred to by its full path
         name (also called an absolute path name).
         / signifies the root group.
         /foo signifies a member of the root group called foo.
         /foo/zoo signifies a member of the group foo, which in turn is a member of the root group.

    HDF5 Attribute List
         Any HDF5 group or data set may have an associated attribute list. An HDF5 attribute is a user-
         defined HDF5 structure that provides extra information about an HDF5 object.
HDF5 Library


    The HDF5 library
          provides several interfaces, or APIs. These APIs provide routines for creating, accessing, and
          manipulating HDF5 files and objects.
          The library itself is implemented in C.
          HDF5 function wrappers have been developed in Java/Fortran90
          All C routines in the HDF5 library begin with a prefix of the form H5*, where * is one or two
          uppercase letters indicating the type of object on which the function operates.

    API
          HF - Library Functions: general-purpose H5 functions
          H5A - Annotation Interface: attribute access and manipulation routines
          H5D - Data set Interface: data set access and manipulation routines
          H5E - Error Interface: error handling routines
          H5F- File Interface: file access routines
          H5G- Group Interface: group creation and operation routines
          H5I - Identifier Interface: identifier routines
          H5P- Property List Interface: object property list manipulation routines
          H5R - Reference Interface: reference routines
          H5S - Data space Interface: data space definition and access routines
          H5T - Data type Interface: data type creation and manipulation routines
          H5Z - Compression Interface: compression routine(s)
HDF5 File Creation

#include "hdf5.h"
#define FILE "dset.h5”

main() {
hid_t file_id, dataset_id, dataspace_id; /* identifiers */
hsize_t dims[2];
herr_t status;

/* Create a new file using default properties. */
file_id = H5Fcreate(FILE, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

/* Create the data space for the dataset. */
dims[0] = 4;
dims[1] = 6;
dataspace_id = H5Screate_simple(2, dims, NULL);

/* Create the dataset. */
dataset_id = H5Dcreate(file_id, "/dset", H5T_STD_I32BE, dataspace_id,
H5P_DEFAULT);

/* End access to the dataset and release resources used by it. */
status = H5Dclose(dataset_id);

/* Terminate access to the data space. */
status = H5Sclose(dataspace_id);

/* Close the file. */
status = H5Fclose(file_id);
}
HDF5 Reading and Writing Existing Datasets

#include "hdf5.h"
   /*
#define FILE "dset.h5"
  * Writing and reading an existing dataset.
main() {
  */
hid_t file_id, dataset_id; /* identifiers */
herr_t status;
int#include "hdf5.h"
    i, j, dset_data[4][6];
  #define FILE "dset.h5"
/* Initialise the dataset. */
for (i = 0; i < 4; i++)
formain() j{< 6; j++)
    (j = 0;
dset_data[i][j] = i * 6 + j + 1;

/* hid_t existing file. */
   Open an file_id, dataset_id; /* identifiers               */
file_id = H5Fopen(FILE, H5F_ACC_RDWR, H5P_DEFAULT);
   herr_t status;
/* int an existing dataset. */
   Open i, j, dset_data[4][6];
dataset_id = H5Dopen(file_id, "/dset");

/* /* Initialize */
   Write the dataset. the dataset. */
status = H5Dwrite(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT,dset_data);
   for (i = 0; i < 4; i++)
status = (j = 0; j < 6; H5T_NATIVE_INT, H5S_ALL,
   for H5Dread(dataset_id, j++)                       H5S_ALL, H5P_DEFAULT,dset_data);

/* dset_data[i][j] = i * 6 + j + 1;
   Close the dataset. */
status = H5Dclose(dataset_id);

/* /* Open file. existing file. */
   Close the an */
status = H5Fclose(file_id);
   file_id = H5Fopen(FILE, H5F_ACC_RDWR,                H5P_DEFAULT);
}

  /* Open an existing dataset. */
  dataset_id = H5Dopen(file_id, "/dset");

  /* Write the dataset. */
  status = H5Dwrite(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL,
  H5P_DEFAULT,
To Probe Further


    HDF5 Home Page
         http://hdf.ncsa.uiuc.edu/HDF5/

    HDF5 Tutorial
         http://www.physics.ohio-state.edu/~wilkins/computing/HDF/hdf5tutorial/index.html


    Enhyper Knowledgebase
         http://www.enhyper.com/lib
         Many grid, finance and operational research related resources
References

   [1] How Much Storage is Enough?
         http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=45

   [2] Different Shades of Meaning in the Stock Market
         http://www.enhyper.com/content/kerrraptor.jpg

   [3] Knurr 10/20KW Water-cooled Environments
         http://www.water-cooled-server-rack.com/

More Related Content

What's hot

STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...The HDF-EOS Tools and Information Center
 
A Survey on Different File Handling Mechanisms in HDFS
A Survey on Different File Handling Mechanisms in HDFSA Survey on Different File Handling Mechanisms in HDFS
A Survey on Different File Handling Mechanisms in HDFSIRJET Journal
 
Sap technical deep dive in a column oriented in memory database
Sap technical deep dive in a column oriented in memory databaseSap technical deep dive in a column oriented in memory database
Sap technical deep dive in a column oriented in memory databaseAlexander Talac
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowJulien Le Dem
 

What's hot (20)

Status of HDF-EOS, Related Software and Tools
Status of HDF-EOS, Related Software and ToolsStatus of HDF-EOS, Related Software and Tools
Status of HDF-EOS, Related Software and Tools
 
HDF Update 2016
HDF Update 2016HDF Update 2016
HDF Update 2016
 
Ensuring Long Term Access to Remotely Sensed HDF4 Data with Layout Maps
Ensuring Long Term Access to Remotely Sensed HDF4 Data with Layout MapsEnsuring Long Term Access to Remotely Sensed HDF4 Data with Layout Maps
Ensuring Long Term Access to Remotely Sensed HDF4 Data with Layout Maps
 
Matlab, Big Data, and HDF Server
Matlab, Big Data, and HDF ServerMatlab, Big Data, and HDF Server
Matlab, Big Data, and HDF Server
 
HDF and netCDF Data Support in ArcGIS
HDF and netCDF Data Support in ArcGISHDF and netCDF Data Support in ArcGIS
HDF and netCDF Data Support in ArcGIS
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
 
Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
HDF Group Support for NPP/NPOESS/JPSS
HDF Group Support for NPP/NPOESS/JPSSHDF Group Support for NPP/NPOESS/JPSS
HDF Group Support for NPP/NPOESS/JPSS
 
NEON HDF5
NEON HDF5NEON HDF5
NEON HDF5
 
A Survey on Different File Handling Mechanisms in HDFS
A Survey on Different File Handling Mechanisms in HDFSA Survey on Different File Handling Mechanisms in HDFS
A Survey on Different File Handling Mechanisms in HDFS
 
HDF & HDF-EOS Data & Support at NSIDC
HDF & HDF-EOS Data & Support at NSIDCHDF & HDF-EOS Data & Support at NSIDC
HDF & HDF-EOS Data & Support at NSIDC
 
HDF Product Designer
HDF Product DesignerHDF Product Designer
HDF Product Designer
 
The HDF Group: Community models and outreach
The HDF Group: Community models and outreachThe HDF Group: Community models and outreach
The HDF Group: Community models and outreach
 
HDFS Erasure Coding in Action
HDFS Erasure Coding in Action HDFS Erasure Coding in Action
HDFS Erasure Coding in Action
 
ICESat-2 Metadata and Status
ICESat-2 Metadata and StatusICESat-2 Metadata and Status
ICESat-2 Metadata and Status
 
Sap technical deep dive in a column oriented in memory database
Sap technical deep dive in a column oriented in memory databaseSap technical deep dive in a column oriented in memory database
Sap technical deep dive in a column oriented in memory database
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
 
Hadoop
HadoopHadoop
Hadoop
 
HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020
 

Similar to Hdf5

DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
 
Grid Asia2008 Low Latency Data Grid
Grid Asia2008 Low Latency Data GridGrid Asia2008 Low Latency Data Grid
Grid Asia2008 Low Latency Data GridJags Ramnarayan
 
th1330-1410effectenbeurszaal4-3v2-140424180955-phpapp01 (1).pdf
th1330-1410effectenbeurszaal4-3v2-140424180955-phpapp01 (1).pdfth1330-1410effectenbeurszaal4-3v2-140424180955-phpapp01 (1).pdf
th1330-1410effectenbeurszaal4-3v2-140424180955-phpapp01 (1).pdfTarekHassan840678
 
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET Journal
 
Monitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service ProvidersMonitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service ProvidersDataWorks Summit
 
Monetizing Big Data at Telecom Service Providers
Monetizing Big Data at Telecom Service ProvidersMonetizing Big Data at Telecom Service Providers
Monetizing Big Data at Telecom Service ProvidersDataWorks Summit
 
Internet of Things and Hadoop
Internet of Things and HadoopInternet of Things and Hadoop
Internet of Things and Hadoopaziksa
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERinside-BigData.com
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big datasolarisyourep
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big dataxKinAnx
 
Solving the Really Big Tech Problems with IoT
 Solving the Really Big Tech Problems with IoT Solving the Really Big Tech Problems with IoT
Solving the Really Big Tech Problems with IoTEric Kavanagh
 
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Kevin Mao
 
Ogf2008 Grid Data Caching
Ogf2008 Grid Data CachingOgf2008 Grid Data Caching
Ogf2008 Grid Data CachingJags Ramnarayan
 
HPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big DataHPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big DataLviv Startup Club
 
Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)Lviv Startup Club
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challengesijcisjournal
 
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
From Single Purpose to Multi Purpose Data Lakes - Broadening End UsersFrom Single Purpose to Multi Purpose Data Lakes - Broadening End Users
From Single Purpose to Multi Purpose Data Lakes - Broadening End UsersDenodo
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Denodo
 

Similar to Hdf5 (20)

BIG DATA
BIG DATABIG DATA
BIG DATA
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Grid Asia2008 Low Latency Data Grid
Grid Asia2008 Low Latency Data GridGrid Asia2008 Low Latency Data Grid
Grid Asia2008 Low Latency Data Grid
 
th1330-1410effectenbeurszaal4-3v2-140424180955-phpapp01 (1).pdf
th1330-1410effectenbeurszaal4-3v2-140424180955-phpapp01 (1).pdfth1330-1410effectenbeurszaal4-3v2-140424180955-phpapp01 (1).pdf
th1330-1410effectenbeurszaal4-3v2-140424180955-phpapp01 (1).pdf
 
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
 
Monitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service ProvidersMonitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service Providers
 
Monetizing Big Data at Telecom Service Providers
Monetizing Big Data at Telecom Service ProvidersMonetizing Big Data at Telecom Service Providers
Monetizing Big Data at Telecom Service Providers
 
Internet of Things and Hadoop
Internet of Things and HadoopInternet of Things and Hadoop
Internet of Things and Hadoop
 
Bigdata
BigdataBigdata
Bigdata
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWER
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 
Solving the Really Big Tech Problems with IoT
 Solving the Really Big Tech Problems with IoT Solving the Really Big Tech Problems with IoT
Solving the Really Big Tech Problems with IoT
 
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
 
Ogf2008 Grid Data Caching
Ogf2008 Grid Data CachingOgf2008 Grid Data Caching
Ogf2008 Grid Data Caching
 
HPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big DataHPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big Data
 
Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challenges
 
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
From Single Purpose to Multi Purpose Data Lakes - Broadening End UsersFrom Single Purpose to Multi Purpose Data Lakes - Broadening End Users
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
 

More from Smith Kim

국토․도시 및 개발 관련법령에서의 권한배분에 따른 현황고찰
국토․도시 및 개발 관련법령에서의 권한배분에 따른 현황고찰국토․도시 및 개발 관련법령에서의 권한배분에 따른 현황고찰
국토․도시 및 개발 관련법령에서의 권한배분에 따른 현황고찰Smith Kim
 
도시계획 권한
도시계획 권한도시계획 권한
도시계획 권한Smith Kim
 
도시계획 권한
도시계획 권한도시계획 권한
도시계획 권한Smith Kim
 
심용옥후보공보물(제출본)
심용옥후보공보물(제출본)심용옥후보공보물(제출본)
심용옥후보공보물(제출본)Smith Kim
 
알고리즘트레이딩 전략교육 커리큘럼 v4.0
알고리즘트레이딩 전략교육 커리큘럼 v4.0알고리즘트레이딩 전략교육 커리큘럼 v4.0
알고리즘트레이딩 전략교육 커리큘럼 v4.0Smith Kim
 
알고리즘거래 종합관리방안 회원설명회(20130325)
알고리즘거래 종합관리방안 회원설명회(20130325)알고리즘거래 종합관리방안 회원설명회(20130325)
알고리즘거래 종합관리방안 회원설명회(20130325)Smith Kim
 
2011노3527대신증권(항소심)
2011노3527대신증권(항소심)2011노3527대신증권(항소심)
2011노3527대신증권(항소심)Smith Kim
 
국내자본시장 내에서의 페어트레이딩전략의 효율성
국내자본시장 내에서의 페어트레이딩전략의 효율성국내자본시장 내에서의 페어트레이딩전략의 효율성
국내자본시장 내에서의 페어트레이딩전략의 효율성Smith Kim
 
Accurate time for linux applications
Accurate time for linux applicationsAccurate time for linux applications
Accurate time for linux applicationsSmith Kim
 
주식투자인구
주식투자인구주식투자인구
주식투자인구Smith Kim
 
Spotlight on hft
Spotlight on hftSpotlight on hft
Spotlight on hftSmith Kim
 
금융투자업규정시행세칙(의안)
금융투자업규정시행세칙(의안)금융투자업규정시행세칙(의안)
금융투자업규정시행세칙(의안)Smith Kim
 
Nyse strategies media day
Nyse strategies   media dayNyse strategies   media day
Nyse strategies media daySmith Kim
 
신가치창출엔진빅데이터의새로운가능성과대응전략
신가치창출엔진빅데이터의새로운가능성과대응전략신가치창출엔진빅데이터의새로운가능성과대응전략
신가치창출엔진빅데이터의새로운가능성과대응전략Smith Kim
 
Zero aos와 DIY HTS비교
Zero aos와 DIY HTS비교Zero aos와 DIY HTS비교
Zero aos와 DIY HTS비교Smith Kim
 
자본시장법 개정안 입법예고(최종)
자본시장법 개정안 입법예고(최종)자본시장법 개정안 입법예고(최종)
자본시장법 개정안 입법예고(최종)Smith Kim
 
Network adapters
Network adaptersNetwork adapters
Network adaptersSmith Kim
 
장내파생상품시장의 현황과 과제(Krx)
장내파생상품시장의 현황과 과제(Krx)장내파생상품시장의 현황과 과제(Krx)
장내파생상품시장의 현황과 과제(Krx)Smith Kim
 
시스템트레이딩(대신증권)
시스템트레이딩(대신증권)시스템트레이딩(대신증권)
시스템트레이딩(대신증권)Smith Kim
 
High frequency trading(우투증권)
High frequency trading(우투증권)High frequency trading(우투증권)
High frequency trading(우투증권)Smith Kim
 

More from Smith Kim (20)

국토․도시 및 개발 관련법령에서의 권한배분에 따른 현황고찰
국토․도시 및 개발 관련법령에서의 권한배분에 따른 현황고찰국토․도시 및 개발 관련법령에서의 권한배분에 따른 현황고찰
국토․도시 및 개발 관련법령에서의 권한배분에 따른 현황고찰
 
도시계획 권한
도시계획 권한도시계획 권한
도시계획 권한
 
도시계획 권한
도시계획 권한도시계획 권한
도시계획 권한
 
심용옥후보공보물(제출본)
심용옥후보공보물(제출본)심용옥후보공보물(제출본)
심용옥후보공보물(제출본)
 
알고리즘트레이딩 전략교육 커리큘럼 v4.0
알고리즘트레이딩 전략교육 커리큘럼 v4.0알고리즘트레이딩 전략교육 커리큘럼 v4.0
알고리즘트레이딩 전략교육 커리큘럼 v4.0
 
알고리즘거래 종합관리방안 회원설명회(20130325)
알고리즘거래 종합관리방안 회원설명회(20130325)알고리즘거래 종합관리방안 회원설명회(20130325)
알고리즘거래 종합관리방안 회원설명회(20130325)
 
2011노3527대신증권(항소심)
2011노3527대신증권(항소심)2011노3527대신증권(항소심)
2011노3527대신증권(항소심)
 
국내자본시장 내에서의 페어트레이딩전략의 효율성
국내자본시장 내에서의 페어트레이딩전략의 효율성국내자본시장 내에서의 페어트레이딩전략의 효율성
국내자본시장 내에서의 페어트레이딩전략의 효율성
 
Accurate time for linux applications
Accurate time for linux applicationsAccurate time for linux applications
Accurate time for linux applications
 
주식투자인구
주식투자인구주식투자인구
주식투자인구
 
Spotlight on hft
Spotlight on hftSpotlight on hft
Spotlight on hft
 
금융투자업규정시행세칙(의안)
금융투자업규정시행세칙(의안)금융투자업규정시행세칙(의안)
금융투자업규정시행세칙(의안)
 
Nyse strategies media day
Nyse strategies   media dayNyse strategies   media day
Nyse strategies media day
 
신가치창출엔진빅데이터의새로운가능성과대응전략
신가치창출엔진빅데이터의새로운가능성과대응전략신가치창출엔진빅데이터의새로운가능성과대응전략
신가치창출엔진빅데이터의새로운가능성과대응전략
 
Zero aos와 DIY HTS비교
Zero aos와 DIY HTS비교Zero aos와 DIY HTS비교
Zero aos와 DIY HTS비교
 
자본시장법 개정안 입법예고(최종)
자본시장법 개정안 입법예고(최종)자본시장법 개정안 입법예고(최종)
자본시장법 개정안 입법예고(최종)
 
Network adapters
Network adaptersNetwork adapters
Network adapters
 
장내파생상품시장의 현황과 과제(Krx)
장내파생상품시장의 현황과 과제(Krx)장내파생상품시장의 현황과 과제(Krx)
장내파생상품시장의 현황과 과제(Krx)
 
시스템트레이딩(대신증권)
시스템트레이딩(대신증권)시스템트레이딩(대신증권)
시스템트레이딩(대신증권)
 
High frequency trading(우투증권)
High frequency trading(우투증권)High frequency trading(우투증권)
High frequency trading(우투증권)
 

Hdf5

  • 1. Financial Data Infrastructure with HDF5 Graeme Burnett Head of Analytics Architecture, Deutsche Bank Sep 2004
  • 2. HDF5 - The Big Idea For IT Management Infrastructure Independence Parallel Data Delivery Configurable To Individual Data Set Granularity Limitless Data Storage Optimised Data Storage Szip compression minimises disk usage/maximises revenue Suited to heterogeneous environment Virtual File Layer (VFL) ported to many platforms A Solution to the ever growing “Data Storage” Issue For Analytics/Market Data Developers Potential to capture limitless market depth and generate limitless analytical models Arbitrary precision, multidimensional and user defined data sets Toolkits in many flavours, C, Java, Perl High performance data access Statistical analysis, 3 D visualisation and pattern recognition become a reality
  • 3. HDF5 - Virtual File Layer Storage Mechanism is Under Application Control e.g network, memory, remote file systems, different file systems on the same machine or to specify special-purpose I/O mechanisms such as streaming I/O, MPI I/O, and buffered I/O. Public APIs so that application developers can write new drivers
  • 4. HDF5 - Data Manipulation Handling Big Data sets Unix file system limit (32-bit) is 2GB - you have to write a program to shift data HDF5 analogous to albumwrap De-aggregation/Re-aggregation of Data Sets Data set can be composed of many discrete sets Easier manipulation of data Sets can be aggregated to form a production set Hyperslabs Data subset selection Transformations during I/O 2D spatial to 3D spatial
  • 5. HDF5 - File Formats Chunking efficient parallel data access Compression Pluggable compression mechanism (Szip) Extensible Allow for expansion - data can be written later Extensible along any dimension Raw Efficient local I/O bypassing the buffer cache External Taking advantage of different file systems (JBOD e.g)
  • 6. The Changing Business and Technology Environment The Move Towards Program Trading Prediction of 70% of current activity will be automated High Frequency Finance Models require high frequency data Internet Technologies The Semantic Web: *ML, RDF, RSS Encryption, Digital Value, Electronic Contracts, Business Intelligence, Psuedonymity Emerging marketplaces, virtual economies and communities Defence Technology Commercialisation Traffic Analysis Operational Research Open Source Intelligence Better, Faster Infrastructure and Software Optical Networks LAMP Architecture COTS Hardware: Opteron, 1GE, SATA
  • 7. HDF5 Feature Overview HDF5 File Format Public Domain, pioneered by the nuclear science community Robust, mature, standards driven Scalable Data Delivery, Efficient Storage, Data Transformation Virtual file Layer supports “chunked” data sets Raw, Standard, Parallel and Networked I/O Bandwidth configurable per data set Data type and spatial transforms of data or subsets during I/O Szip - high performance compression/decompression Infrastructure agnostic Metadata approach No specialised hardware required Suitable for distributed/lightweight architectures: Grid, COTS
  • 8. Today’s Infrastructure Architecture in the City Record orientated approach to data Flat delimited files, ftp and databases Time series - a black art Poor communication between architecture, infrastructure and developers Centralised/Cross-business infrastructure failures Hardware/Manufacturer Specific Solutions Generic, complex, high-cost, manufacturer lock-in, technological mediocrity Designed to suit Manufacturer’s capability model Lack of Technology Knowledge Management Architecture is vendor and developer driven, should be experience and business driven. No framework for conversion of tacit to explicit knowledge results in loss of operational expertise. Solution: Development portals (CM, Bug tracking etc,) Twiki’s, Blogging
  • 9. Grid Infrastructure Centralised Data Centres Centralised data means centralised risk. Extreme risk events would render business continuity planning ineffective. Huge energy requirements (15MW, 5MW of which is cooling) Data is Mobile - Not All Data Needs Enterprise Class Persistence HDF5 makes it easy to forward cache static/reference data calved from master data sets Regional/Divisional/Departmental/Workgroup Real-time computational derivation using FPGA’s and/or calculation farms Reduced cost whilst maintaining regulatory compliance Micro-hosting - Crisis Resilient Grid Architecture Software chooses the most appropriate execution environment and marshals data accordingly Each site operational has 20-30 low cost COTS nodes, minimal cooling, energy footprint up to 15KW, multiple network connections. KNURR Secure 10/20KW Water cooled cabinets located across infrastructure [3]
  • 10. HDF5 - Enabling High-Frequency Finance HFF Drivers Advances in statistical and Operational Research techniques Improved access to data sets by academia. Advances in computer infrastructure, reduction in storage costs The need to find a trading edge. Quantitative approach augmented by statistical analysis HDF5 Global Data Repository For Raptor [2] Project under way at Deutsche Bank, part of the Raptor infrastructure Raptor requires massive data sets Providing “early warning” predictive analysis Pre transaction validation of complex derivatives transactions Synthesised Data: yield curves, volatility surfaces (time/strike), vwap News capture Sentiment Analysis Traffic Analysis
  • 11. Statistical Data Modelling Massive Data Sets Required to determine the statistical significance of outliers in data Visualisation Point cloud identification of immediate outliers relationships by parallel co-ordinate plotting
  • 12. Statistical Data Modelling Cluster Analysis Useful to focus further analysis, and data cleansing. Focus on regions in data space where object is dense Proprietary Filtering Algorithms to determine number and significance of clusters Density Estimation Visualisation of stochastics Optimal means to cluster But - real-time computation is difficult MATLAB particularly rich in methods for this HDF5 enabled
  • 13. Time Series Data Relationships Between Time Series Data Important potentially profitable in High Frequency Finance Used to determine ideal hedge etc. Exploratory Analysis by Parallel Co-ordinate Plots followed by Econometric methods to determine significance.
  • 14. Point Clouds Visualisation Technique for Large Data Sets a slice of trades from a multidimensional trade data set. Approximately 29K trades being displayed = 2 weeks (VOD).
  • 15. Data Depth Perspective Figures Are Humanly Incomprehensible The World Produced in 1999 [1]: 1.5 exabytes (260) of storable content - 1.5 billion gigabytes 250 megabytes for every man, woman, and child on earth. Printed documents of all kinds make up only .003 percent of the total. Magnetic storage is by far the largest medium for storing information and is the most rapidly growing Shipped hard-drive capacity doubling every year. Amount of human generated content - 5TB Financial Market Data LSE Basic set is 14GB for 2 Years (stock, shares, price, bid, ask, flags) Market Depth + News + Traffic Analysis VWAP + Volatilities Many Terabytes required
  • 16. HFF/HDF5 Infrastructure Lightweight Infrastructure Using COTS Components Pattern Recognition/DSS Node - ~$25,000 for 24TB JBOD Node Sparse Data Analysis Analytics/HDF5 node $2700 Data delivery, computation “Throwaway nodes” - reduced hosting costs
  • 18. HDF5 Data Distribution Architecture Tier 1 - Master Data Sets Enterprise grade persistence Satisfy Data Retention Regulations: Gramm-Leach-Bliley - security and confidentiality Sarbanes-Oxley - the need for data retention I/O profile per data set to suit predefined SLA Tier 2 - Derived Regional Data sets Geo-legislative Data Partitioning Tier 3 - Divisional/Departmental Data Sets Reduced infrastructure requirements Forward Caching - data near point of consumption Dataset Enrichment - pattern recognition, aggregation, data set generation Tier 4 - Workstation Spare-cycle computing Specialist enrichment
  • 20. HDF5 Tier Functionality Tier 1 - Master Data Sets Contiguous Tick Data Sets per exchange Data wraps - aggregated small sets Tier 2 - Derived Regional Data sets Pre-partitioned by stock/year/month/week Delivered to lightweight HDF5 near-network server nodes Tick Data collection, Location-based analysis, aggregation and enhancement Data set discovery using Directory Services Tier 3 - Divisional/Departmental Data Sets Forward Caching - data near point of consumption Data set Enrichment - pattern recognition, aggregation, data set generation DSS/Pattern Recognition nodes - JBOD Consumption - high performance I/O HDF5 delivery to T4 Tier 4 - Workstation/Desktop Supercomputer Spare-cycle computing Data enrichment using specialist hardware/architecture (FPGA)
  • 21. Data Discovery Ontologies An ontology is a conceptual model about some domain Relationships that hold between them Characteristics of data Data set Description using Protégé and OWL XML/RDF Metadata Can forward generate Database and XML Schema’s Data Classification WEKA - data classification suite written in Java Pattern Recognition News analysis Envelope/Outlier analysis
  • 23. Security Framework Basic Access Control Primitives Advisory security mechanism Read-only, read-write etc - stored in file meta data Host based access control Global/Federated Operation Requires Third-party Access Control Manager Xboost from www.ivis.com Integration to enterprise directories - both .Net and LDAP Service Orientated Architecture:
  • 24. HDF5 Data Format HDF5 File is a container for storing a variety of scientific data is composed of two primary types of objects: groups and data sets. HDF5 Group a grouping structure containing zero or more HDF5 objects, together with supporting metadata HDF5 Data Set a multidimensional array of data elements, together with supporting metadata Similar to working with directories and files in UNIX: an HDF5 object in an HDF5 file is often referred to by its full path name (also called an absolute path name). / signifies the root group. /foo signifies a member of the root group called foo. /foo/zoo signifies a member of the group foo, which in turn is a member of the root group. HDF5 Attribute List Any HDF5 group or data set may have an associated attribute list. An HDF5 attribute is a user- defined HDF5 structure that provides extra information about an HDF5 object.
  • 25. HDF5 Library The HDF5 library provides several interfaces, or APIs. These APIs provide routines for creating, accessing, and manipulating HDF5 files and objects. The library itself is implemented in C. HDF5 function wrappers have been developed in Java/Fortran90 All C routines in the HDF5 library begin with a prefix of the form H5*, where * is one or two uppercase letters indicating the type of object on which the function operates. API HF - Library Functions: general-purpose H5 functions H5A - Annotation Interface: attribute access and manipulation routines H5D - Data set Interface: data set access and manipulation routines H5E - Error Interface: error handling routines H5F- File Interface: file access routines H5G- Group Interface: group creation and operation routines H5I - Identifier Interface: identifier routines H5P- Property List Interface: object property list manipulation routines H5R - Reference Interface: reference routines H5S - Data space Interface: data space definition and access routines H5T - Data type Interface: data type creation and manipulation routines H5Z - Compression Interface: compression routine(s)
  • 26. HDF5 File Creation #include "hdf5.h" #define FILE "dset.h5” main() { hid_t file_id, dataset_id, dataspace_id; /* identifiers */ hsize_t dims[2]; herr_t status; /* Create a new file using default properties. */ file_id = H5Fcreate(FILE, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT); /* Create the data space for the dataset. */ dims[0] = 4; dims[1] = 6; dataspace_id = H5Screate_simple(2, dims, NULL); /* Create the dataset. */ dataset_id = H5Dcreate(file_id, "/dset", H5T_STD_I32BE, dataspace_id, H5P_DEFAULT); /* End access to the dataset and release resources used by it. */ status = H5Dclose(dataset_id); /* Terminate access to the data space. */ status = H5Sclose(dataspace_id); /* Close the file. */ status = H5Fclose(file_id); }
  • 27. HDF5 Reading and Writing Existing Datasets #include "hdf5.h" /* #define FILE "dset.h5" * Writing and reading an existing dataset. main() { */ hid_t file_id, dataset_id; /* identifiers */ herr_t status; int#include "hdf5.h" i, j, dset_data[4][6]; #define FILE "dset.h5" /* Initialise the dataset. */ for (i = 0; i < 4; i++) formain() j{< 6; j++) (j = 0; dset_data[i][j] = i * 6 + j + 1; /* hid_t existing file. */ Open an file_id, dataset_id; /* identifiers */ file_id = H5Fopen(FILE, H5F_ACC_RDWR, H5P_DEFAULT); herr_t status; /* int an existing dataset. */ Open i, j, dset_data[4][6]; dataset_id = H5Dopen(file_id, "/dset"); /* /* Initialize */ Write the dataset. the dataset. */ status = H5Dwrite(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT,dset_data); for (i = 0; i < 4; i++) status = (j = 0; j < 6; H5T_NATIVE_INT, H5S_ALL, for H5Dread(dataset_id, j++) H5S_ALL, H5P_DEFAULT,dset_data); /* dset_data[i][j] = i * 6 + j + 1; Close the dataset. */ status = H5Dclose(dataset_id); /* /* Open file. existing file. */ Close the an */ status = H5Fclose(file_id); file_id = H5Fopen(FILE, H5F_ACC_RDWR, H5P_DEFAULT); } /* Open an existing dataset. */ dataset_id = H5Dopen(file_id, "/dset"); /* Write the dataset. */ status = H5Dwrite(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT,
  • 28. To Probe Further HDF5 Home Page http://hdf.ncsa.uiuc.edu/HDF5/ HDF5 Tutorial http://www.physics.ohio-state.edu/~wilkins/computing/HDF/hdf5tutorial/index.html Enhyper Knowledgebase http://www.enhyper.com/lib Many grid, finance and operational research related resources
  • 29. References [1] How Much Storage is Enough? http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=45 [2] Different Shades of Meaning in the Stock Market http://www.enhyper.com/content/kerrraptor.jpg [3] Knurr 10/20KW Water-cooled Environments http://www.water-cooled-server-rack.com/