GEN:A Database
Interface Generator
for
HPC Programs
Quan Pham,Tanu Malik
Computation Institute
University of Chicago and Argonne National Laboratory
tanum@uchicago.edu
Monday, June 29, 15
Computational Science
Challenges
• Science is iterative and data-intensive
• Simulate-Analyze-Simulate
• Supercomputing time is precious
• So reduce time for data analysis
• Challenges
• Slow IO and network bandwidth
• How to remove redundant and IO-
intensive steps?
Monday, June 29, 15
Computational Cosmology
(current state-of-art)
• Data sizes reduce as analysis proceeds
• Simulation size is few Petabytes
• Analysis data may range from TBs to 10s of GBs
Simulations
Scripts
Visualization
Database
Analyze
Halo/sub-halo
Identification
Halo Merger
Trees
Identifying
GalaxiesPVFS
FS
Simulate
Madduri, Malik, Habib, et. al. PDACS: A Portal for Data Analysis Services for Cosmological Simulations. In: ACM
Computing In Science and Engineering, 2015
Monday, June 29, 15
Computational Cosmology
(envisioned state-of-art)
• Database-friendly analysis within the DB
• Mechanisms for direct IO to databases
Simulations
Scripts
Visualization
Database
Analyze
Halo/sub-halo
Identification
Halo Merger
Trees
Identifying
GalaxiesPVFS
Simulate
Monday, June 29, 15
Related Work
• The SciHadoop System
• Move data between file-systems (from PVFS to HDFS)
• Simple sub-selection and aggregation queries
• Querying using HPC libraries
• Transform sub-selection and aggregation SQL queries
into parallel IO operations using parallel-NetCDF
• Requires user to perform parallel data management
• Database-as-a-service
• RESTful service approach good for small amount of
data
Monday, June 29, 15
GEN:A Database
Interface Generator
• Takes user-supplied C declarations and
provides an interface to load into and
access data from common scientific array
databases
Monday, June 29, 15
GEN Overview
MPI_Init ( &argc, &argv );
MPI_Comm_rank ( MPI_COMM_WORLD, &id );
MPI_Comm_size ( MPI_COMM_WORLD, &p );
if ( rank == 0 )
{
x_file = fopen ( "x_data.txt", "w" );
for ( j = 0; j < n; j++ )
{
for ( i = 0; i < m; i++ )
fprintf ( output, " %24.16g", table[i+j*m] );
fprintf ( output, "n" );
}
fclose ( x_file );
}
Figure 2: A sample program I/O
MPI_Init ( &argc, &argv );
MPI_Comm_rank ( MPI_COMM_WORLD, &id );
MPI_Comm_size ( MPI_COMM_WORLD, &p );
if ( rank == 0 )
(ld.so) to load the specified shared libraries. Symbol defin
tions in LD_PRELOADed libraries will be found before defin
tions in non-LD_PRELOADed libraries, so if an LD_PRELOADe
library contains a definition of, say, fopen, then that cod
will be executed when the process references fopen. In pa
ticular, this means that we can package our wrapper fun
tions into a shared library (.so file) and use LD_PRELOAD t
cause them to be loaded.
The LD_PRELOAD directive helps to direct system calls t
GEN’s wrapper functions, but at run-time GEN’s wrapper fun
tions still does not have the handle to the original syste
call functions, the arguments of which are necessary to cr
ate corresponding database DDL statements in the wrapp
function and then execute them.
3.2 Creating Database DDL statements
To create database DDL statements we need the handle t
the original system call function and use the parameters
the original system call function create appropriate databa
DDL statements. On Unix platforms, the dlsym functio
dlsym(void* handle, char* symbol) returns the addre
of the symbol given as its second argument, searching the dy
namic or shared library accessible through the handle give
as its first argument. Thus, assuming the only definitions
fopen are in our shared library and in the system C librar
fprintf ( output, " %24.16g", table[i+j*m] );
fprintf ( output, "n" );
}
fclose ( x_file );
}
Figure 2: A sample program I/O
MPI_Init ( &argc, &argv );
MPI_Comm_rank ( MPI_COMM_WORLD, &id );
MPI_Comm_size ( MPI_COMM_WORLD, &p );
if ( rank == 0 )
{
CREATE TABLE "x_data.txt"
for ( j = 0; j < n; j++ )
{
for ( i = 0; i < m; i++ )
INSERT into "x_data.txt" VALUES (table[i+j*m])
fprintf ( output, "n" );
}
fclose ( x_file );
}
Figure 3: Incorrect Database Interface of Sample
Program
To generate the correct interface, GEN relies on (a) run-
time wrapping of POSIX I/O calls, (b) run-time bookkeep-
ing information for creation of database DDL statements,Monday, June 29, 15
Issues
• IO patterns
• N-1 non-strided or N-1 strided or N-N
• Scattered IO calls
• No direct mapping between POSIX calls and
DB statements
• Direct C POSIX calls to DB statements
• Constraints:
• No source code changes
Monday, June 29, 15
GEN
• Assumptions
• Serial IO for now
• IO done contiguously within the
context of a single function
Monday, June 29, 15
GEN-1
• Map system calls to DDL statements without
changing source code
• LD_PRELOAD: load a shared library
• Use LD_PRELOAD to force load a library
that has GEN’s implementation of fopen(),
fwrite(), fread()
fopen("x_data.txt", w")
Create <Database Object> x_data.txt.tmp <structure>
Monday, June 29, 15
GEN-1I
• Schema-later; no structure initially, reorganize later
fopen("x_data.txt", w")
Create Table x_data.txt.tmp <i int; x double>;
Create Array x_data.txt.tmp <x:double> [i=0:*]
• OS-DB impedance mismatch
• Not every fwrite() translates to INSERT
statement
• fwrites are buffered before being sent to
filesystem
• A single statement when data is being flushed
Monday, June 29, 15
Knowing the Structure
• User-defined
• Header files
• Allow some system calls to create sample
files
Monday, June 29, 15
Experiments on Merger Trees
• Building merger-
trees in DB system
• Built from halo particles,
i.e. particles in a given
halo
• Several time-steps
• Perform a particle
merge-join between
halos of timestep ti-1 and
ti
• Lineage queries to
answer ancestors and
progenitors
• MPI program
• Streams data to a single node,
which streams to GEN
• Postgres and SciDB interfaces
Monday, June 29, 15
Experimental Setup
• Time for reorganizing the data
• Reorganizing is cheaper in the DB than
outside
n, such as
e), 1, fout);
re to which data belongs to,
aration specified in header
tributes and type informa-
e flattened out as attributes
ase, and as attributes of a
ray database. Thus nested
ined by a foreign key. Using
bookkeeping we can create
ntf statements and sprintf
ctly From Program to
responding INSERT/LOAD
he operating system-database
ere in a fwrite() in a C pro-
INSERT/LOAD statement
of system libraries in which
ng to the OS kernel and the
treaming writes to improve
in this performance advan-
atement for the fwrite, but
nto the temporary table of
to write. Thus for the pre-
variation of 0.02. The redirection time is not influenced by
increasing the number of elements to write in the program,
since the libraries are already loaded in memory. The varia-
tion comes if we execute other programs, which can o✏oad
the library from memory and it needs to reloaded again.
In the second experiment, we measured the space vs time
overhead of reorganizing the data. In GEN we have not saved
any space. Without GEN the binary data is written out on
files and then within the database. With GEN the data is
written out to temporary tables within the database and
then reorganized within the database. However, as Table 1
shows that there is significant savings in terms of reorganiz-
ing the data within the database versus reorganizing prior
to loading the database.
Table shows the time it takes to load increasingly larger
sized datasets into a relational and an array database vs how
much time it takes to load the same data from a temporary
table into a formatted table plus the time of loading the
initial binary data into a temporary table.
Table 1: Reorganization times
Data Size Postgres SciDB
(in GB)
Load from
Files
Reorganize
within DB
Load from
Files
Reorganize
within DB
1 2.7 1.2 8.4 3.2
5 6.8 2.5 14.3 7.6
10 10.56 4.7 22.6 8.9
• Overhead of redirection
• Dynamically-linked library may get off-
loaded
• 0.04 ms + 0.02ms-
Monday, June 29, 15
Conclusions and
Current Work
• A database generator library for data-
intensive scientific applications that
requires no source code changes, which
loads to be reorganized later.
• Relax the assumptions
• Loading is still an issue
• Source-code release
Monday, June 29, 15
• ThankYou!
• Acknowledgements
• Salman Habib, Katrin Heitmann, Steve
Rangel, Hal Finkel, Ravi Madduri
Monday, June 29, 15
Monday, June 29, 15
A vision
HPC Cluster
High-bandwidth
Network
Database
Servers
Monday, June 29, 15

GEN: A Database Interface Generator for HPC Programs

  • 1.
    GEN:A Database Interface Generator for HPCPrograms Quan Pham,Tanu Malik Computation Institute University of Chicago and Argonne National Laboratory tanum@uchicago.edu Monday, June 29, 15
  • 2.
    Computational Science Challenges • Scienceis iterative and data-intensive • Simulate-Analyze-Simulate • Supercomputing time is precious • So reduce time for data analysis • Challenges • Slow IO and network bandwidth • How to remove redundant and IO- intensive steps? Monday, June 29, 15
  • 3.
    Computational Cosmology (current state-of-art) •Data sizes reduce as analysis proceeds • Simulation size is few Petabytes • Analysis data may range from TBs to 10s of GBs Simulations Scripts Visualization Database Analyze Halo/sub-halo Identification Halo Merger Trees Identifying GalaxiesPVFS FS Simulate Madduri, Malik, Habib, et. al. PDACS: A Portal for Data Analysis Services for Cosmological Simulations. In: ACM Computing In Science and Engineering, 2015 Monday, June 29, 15
  • 4.
    Computational Cosmology (envisioned state-of-art) •Database-friendly analysis within the DB • Mechanisms for direct IO to databases Simulations Scripts Visualization Database Analyze Halo/sub-halo Identification Halo Merger Trees Identifying GalaxiesPVFS Simulate Monday, June 29, 15
  • 5.
    Related Work • TheSciHadoop System • Move data between file-systems (from PVFS to HDFS) • Simple sub-selection and aggregation queries • Querying using HPC libraries • Transform sub-selection and aggregation SQL queries into parallel IO operations using parallel-NetCDF • Requires user to perform parallel data management • Database-as-a-service • RESTful service approach good for small amount of data Monday, June 29, 15
  • 6.
    GEN:A Database Interface Generator •Takes user-supplied C declarations and provides an interface to load into and access data from common scientific array databases Monday, June 29, 15
  • 7.
    GEN Overview MPI_Init (&argc, &argv ); MPI_Comm_rank ( MPI_COMM_WORLD, &id ); MPI_Comm_size ( MPI_COMM_WORLD, &p ); if ( rank == 0 ) { x_file = fopen ( "x_data.txt", "w" ); for ( j = 0; j < n; j++ ) { for ( i = 0; i < m; i++ ) fprintf ( output, " %24.16g", table[i+j*m] ); fprintf ( output, "n" ); } fclose ( x_file ); } Figure 2: A sample program I/O MPI_Init ( &argc, &argv ); MPI_Comm_rank ( MPI_COMM_WORLD, &id ); MPI_Comm_size ( MPI_COMM_WORLD, &p ); if ( rank == 0 ) (ld.so) to load the specified shared libraries. Symbol defin tions in LD_PRELOADed libraries will be found before defin tions in non-LD_PRELOADed libraries, so if an LD_PRELOADe library contains a definition of, say, fopen, then that cod will be executed when the process references fopen. In pa ticular, this means that we can package our wrapper fun tions into a shared library (.so file) and use LD_PRELOAD t cause them to be loaded. The LD_PRELOAD directive helps to direct system calls t GEN’s wrapper functions, but at run-time GEN’s wrapper fun tions still does not have the handle to the original syste call functions, the arguments of which are necessary to cr ate corresponding database DDL statements in the wrapp function and then execute them. 3.2 Creating Database DDL statements To create database DDL statements we need the handle t the original system call function and use the parameters the original system call function create appropriate databa DDL statements. On Unix platforms, the dlsym functio dlsym(void* handle, char* symbol) returns the addre of the symbol given as its second argument, searching the dy namic or shared library accessible through the handle give as its first argument. Thus, assuming the only definitions fopen are in our shared library and in the system C librar fprintf ( output, " %24.16g", table[i+j*m] ); fprintf ( output, "n" ); } fclose ( x_file ); } Figure 2: A sample program I/O MPI_Init ( &argc, &argv ); MPI_Comm_rank ( MPI_COMM_WORLD, &id ); MPI_Comm_size ( MPI_COMM_WORLD, &p ); if ( rank == 0 ) { CREATE TABLE "x_data.txt" for ( j = 0; j < n; j++ ) { for ( i = 0; i < m; i++ ) INSERT into "x_data.txt" VALUES (table[i+j*m]) fprintf ( output, "n" ); } fclose ( x_file ); } Figure 3: Incorrect Database Interface of Sample Program To generate the correct interface, GEN relies on (a) run- time wrapping of POSIX I/O calls, (b) run-time bookkeep- ing information for creation of database DDL statements,Monday, June 29, 15
  • 8.
    Issues • IO patterns •N-1 non-strided or N-1 strided or N-N • Scattered IO calls • No direct mapping between POSIX calls and DB statements • Direct C POSIX calls to DB statements • Constraints: • No source code changes Monday, June 29, 15
  • 9.
    GEN • Assumptions • SerialIO for now • IO done contiguously within the context of a single function Monday, June 29, 15
  • 10.
    GEN-1 • Map systemcalls to DDL statements without changing source code • LD_PRELOAD: load a shared library • Use LD_PRELOAD to force load a library that has GEN’s implementation of fopen(), fwrite(), fread() fopen("x_data.txt", w") Create <Database Object> x_data.txt.tmp <structure> Monday, June 29, 15
  • 11.
    GEN-1I • Schema-later; nostructure initially, reorganize later fopen("x_data.txt", w") Create Table x_data.txt.tmp <i int; x double>; Create Array x_data.txt.tmp <x:double> [i=0:*] • OS-DB impedance mismatch • Not every fwrite() translates to INSERT statement • fwrites are buffered before being sent to filesystem • A single statement when data is being flushed Monday, June 29, 15
  • 12.
    Knowing the Structure •User-defined • Header files • Allow some system calls to create sample files Monday, June 29, 15
  • 13.
    Experiments on MergerTrees • Building merger- trees in DB system • Built from halo particles, i.e. particles in a given halo • Several time-steps • Perform a particle merge-join between halos of timestep ti-1 and ti • Lineage queries to answer ancestors and progenitors • MPI program • Streams data to a single node, which streams to GEN • Postgres and SciDB interfaces Monday, June 29, 15
  • 14.
    Experimental Setup • Timefor reorganizing the data • Reorganizing is cheaper in the DB than outside n, such as e), 1, fout); re to which data belongs to, aration specified in header tributes and type informa- e flattened out as attributes ase, and as attributes of a ray database. Thus nested ined by a foreign key. Using bookkeeping we can create ntf statements and sprintf ctly From Program to responding INSERT/LOAD he operating system-database ere in a fwrite() in a C pro- INSERT/LOAD statement of system libraries in which ng to the OS kernel and the treaming writes to improve in this performance advan- atement for the fwrite, but nto the temporary table of to write. Thus for the pre- variation of 0.02. The redirection time is not influenced by increasing the number of elements to write in the program, since the libraries are already loaded in memory. The varia- tion comes if we execute other programs, which can o✏oad the library from memory and it needs to reloaded again. In the second experiment, we measured the space vs time overhead of reorganizing the data. In GEN we have not saved any space. Without GEN the binary data is written out on files and then within the database. With GEN the data is written out to temporary tables within the database and then reorganized within the database. However, as Table 1 shows that there is significant savings in terms of reorganiz- ing the data within the database versus reorganizing prior to loading the database. Table shows the time it takes to load increasingly larger sized datasets into a relational and an array database vs how much time it takes to load the same data from a temporary table into a formatted table plus the time of loading the initial binary data into a temporary table. Table 1: Reorganization times Data Size Postgres SciDB (in GB) Load from Files Reorganize within DB Load from Files Reorganize within DB 1 2.7 1.2 8.4 3.2 5 6.8 2.5 14.3 7.6 10 10.56 4.7 22.6 8.9 • Overhead of redirection • Dynamically-linked library may get off- loaded • 0.04 ms + 0.02ms- Monday, June 29, 15
  • 15.
    Conclusions and Current Work •A database generator library for data- intensive scientific applications that requires no source code changes, which loads to be reorganized later. • Relax the assumptions • Loading is still an issue • Source-code release Monday, June 29, 15
  • 16.
    • ThankYou! • Acknowledgements •Salman Habib, Katrin Heitmann, Steve Rangel, Hal Finkel, Ravi Madduri Monday, June 29, 15
  • 17.
  • 18.