SlideShare a Scribd company logo
GEN:A Database
Interface Generator
for
HPC Programs
Quan Pham,Tanu Malik
Computation Institute
University of Chicago and Argonne National Laboratory
tanum@uchicago.edu
Monday, June 29, 15
Computational Science
Challenges
• Science is iterative and data-intensive
• Simulate-Analyze-Simulate
• Supercomputing time is precious
• So reduce time for data analysis
• Challenges
• Slow IO and network bandwidth
• How to remove redundant and IO-
intensive steps?
Monday, June 29, 15
Computational Cosmology
(current state-of-art)
• Data sizes reduce as analysis proceeds
• Simulation size is few Petabytes
• Analysis data may range from TBs to 10s of GBs
Simulations
Scripts
Visualization
Database
Analyze
Halo/sub-halo
Identification
Halo Merger
Trees
Identifying
GalaxiesPVFS
FS
Simulate
Madduri, Malik, Habib, et. al. PDACS: A Portal for Data Analysis Services for Cosmological Simulations. In: ACM
Computing In Science and Engineering, 2015
Monday, June 29, 15
Computational Cosmology
(envisioned state-of-art)
• Database-friendly analysis within the DB
• Mechanisms for direct IO to databases
Simulations
Scripts
Visualization
Database
Analyze
Halo/sub-halo
Identification
Halo Merger
Trees
Identifying
GalaxiesPVFS
Simulate
Monday, June 29, 15
Related Work
• The SciHadoop System
• Move data between file-systems (from PVFS to HDFS)
• Simple sub-selection and aggregation queries
• Querying using HPC libraries
• Transform sub-selection and aggregation SQL queries
into parallel IO operations using parallel-NetCDF
• Requires user to perform parallel data management
• Database-as-a-service
• RESTful service approach good for small amount of
data
Monday, June 29, 15
GEN:A Database
Interface Generator
• Takes user-supplied C declarations and
provides an interface to load into and
access data from common scientific array
databases
Monday, June 29, 15
GEN Overview
MPI_Init ( &argc, &argv );
MPI_Comm_rank ( MPI_COMM_WORLD, &id );
MPI_Comm_size ( MPI_COMM_WORLD, &p );
if ( rank == 0 )
{
x_file = fopen ( "x_data.txt", "w" );
for ( j = 0; j < n; j++ )
{
for ( i = 0; i < m; i++ )
fprintf ( output, " %24.16g", table[i+j*m] );
fprintf ( output, "n" );
}
fclose ( x_file );
}
Figure 2: A sample program I/O
MPI_Init ( &argc, &argv );
MPI_Comm_rank ( MPI_COMM_WORLD, &id );
MPI_Comm_size ( MPI_COMM_WORLD, &p );
if ( rank == 0 )
(ld.so) to load the specified shared libraries. Symbol defin
tions in LD_PRELOADed libraries will be found before defin
tions in non-LD_PRELOADed libraries, so if an LD_PRELOADe
library contains a definition of, say, fopen, then that cod
will be executed when the process references fopen. In pa
ticular, this means that we can package our wrapper fun
tions into a shared library (.so file) and use LD_PRELOAD t
cause them to be loaded.
The LD_PRELOAD directive helps to direct system calls t
GEN’s wrapper functions, but at run-time GEN’s wrapper fun
tions still does not have the handle to the original syste
call functions, the arguments of which are necessary to cr
ate corresponding database DDL statements in the wrapp
function and then execute them.
3.2 Creating Database DDL statements
To create database DDL statements we need the handle t
the original system call function and use the parameters
the original system call function create appropriate databa
DDL statements. On Unix platforms, the dlsym functio
dlsym(void* handle, char* symbol) returns the addre
of the symbol given as its second argument, searching the dy
namic or shared library accessible through the handle give
as its first argument. Thus, assuming the only definitions
fopen are in our shared library and in the system C librar
fprintf ( output, " %24.16g", table[i+j*m] );
fprintf ( output, "n" );
}
fclose ( x_file );
}
Figure 2: A sample program I/O
MPI_Init ( &argc, &argv );
MPI_Comm_rank ( MPI_COMM_WORLD, &id );
MPI_Comm_size ( MPI_COMM_WORLD, &p );
if ( rank == 0 )
{
CREATE TABLE "x_data.txt"
for ( j = 0; j < n; j++ )
{
for ( i = 0; i < m; i++ )
INSERT into "x_data.txt" VALUES (table[i+j*m])
fprintf ( output, "n" );
}
fclose ( x_file );
}
Figure 3: Incorrect Database Interface of Sample
Program
To generate the correct interface, GEN relies on (a) run-
time wrapping of POSIX I/O calls, (b) run-time bookkeep-
ing information for creation of database DDL statements,Monday, June 29, 15
Issues
• IO patterns
• N-1 non-strided or N-1 strided or N-N
• Scattered IO calls
• No direct mapping between POSIX calls and
DB statements
• Direct C POSIX calls to DB statements
• Constraints:
• No source code changes
Monday, June 29, 15
GEN
• Assumptions
• Serial IO for now
• IO done contiguously within the
context of a single function
Monday, June 29, 15
GEN-1
• Map system calls to DDL statements without
changing source code
• LD_PRELOAD: load a shared library
• Use LD_PRELOAD to force load a library
that has GEN’s implementation of fopen(),
fwrite(), fread()
fopen("x_data.txt", w")
Create <Database Object> x_data.txt.tmp <structure>
Monday, June 29, 15
GEN-1I
• Schema-later; no structure initially, reorganize later
fopen("x_data.txt", w")
Create Table x_data.txt.tmp <i int; x double>;
Create Array x_data.txt.tmp <x:double> [i=0:*]
• OS-DB impedance mismatch
• Not every fwrite() translates to INSERT
statement
• fwrites are buffered before being sent to
filesystem
• A single statement when data is being flushed
Monday, June 29, 15
Knowing the Structure
• User-defined
• Header files
• Allow some system calls to create sample
files
Monday, June 29, 15
Experiments on Merger Trees
• Building merger-
trees in DB system
• Built from halo particles,
i.e. particles in a given
halo
• Several time-steps
• Perform a particle
merge-join between
halos of timestep ti-1 and
ti
• Lineage queries to
answer ancestors and
progenitors
• MPI program
• Streams data to a single node,
which streams to GEN
• Postgres and SciDB interfaces
Monday, June 29, 15
Experimental Setup
• Time for reorganizing the data
• Reorganizing is cheaper in the DB than
outside
n, such as
e), 1, fout);
re to which data belongs to,
aration specified in header
tributes and type informa-
e flattened out as attributes
ase, and as attributes of a
ray database. Thus nested
ined by a foreign key. Using
bookkeeping we can create
ntf statements and sprintf
ctly From Program to
responding INSERT/LOAD
he operating system-database
ere in a fwrite() in a C pro-
INSERT/LOAD statement
of system libraries in which
ng to the OS kernel and the
treaming writes to improve
in this performance advan-
atement for the fwrite, but
nto the temporary table of
to write. Thus for the pre-
variation of 0.02. The redirection time is not influenced by
increasing the number of elements to write in the program,
since the libraries are already loaded in memory. The varia-
tion comes if we execute other programs, which can o✏oad
the library from memory and it needs to reloaded again.
In the second experiment, we measured the space vs time
overhead of reorganizing the data. In GEN we have not saved
any space. Without GEN the binary data is written out on
files and then within the database. With GEN the data is
written out to temporary tables within the database and
then reorganized within the database. However, as Table 1
shows that there is significant savings in terms of reorganiz-
ing the data within the database versus reorganizing prior
to loading the database.
Table shows the time it takes to load increasingly larger
sized datasets into a relational and an array database vs how
much time it takes to load the same data from a temporary
table into a formatted table plus the time of loading the
initial binary data into a temporary table.
Table 1: Reorganization times
Data Size Postgres SciDB
(in GB)
Load from
Files
Reorganize
within DB
Load from
Files
Reorganize
within DB
1 2.7 1.2 8.4 3.2
5 6.8 2.5 14.3 7.6
10 10.56 4.7 22.6 8.9
• Overhead of redirection
• Dynamically-linked library may get off-
loaded
• 0.04 ms + 0.02ms-
Monday, June 29, 15
Conclusions and
Current Work
• A database generator library for data-
intensive scientific applications that
requires no source code changes, which
loads to be reorganized later.
• Relax the assumptions
• Loading is still an issue
• Source-code release
Monday, June 29, 15
• ThankYou!
• Acknowledgements
• Salman Habib, Katrin Heitmann, Steve
Rangel, Hal Finkel, Ravi Madduri
Monday, June 29, 15
Monday, June 29, 15
A vision
HPC Cluster
High-bandwidth
Network
Database
Servers
Monday, June 29, 15

More Related Content

What's hot

Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
Ian Foster
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud Automation
Ian Foster
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
Ian Foster
 
Big linked geospatial data tools in ExtremeEarth-phiweek19
Big linked geospatial data tools in ExtremeEarth-phiweek19Big linked geospatial data tools in ExtremeEarth-phiweek19
Big linked geospatial data tools in ExtremeEarth-phiweek19
ExtremeEarth
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...
University of California, San Diego
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
Ian Foster
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Ian Foster
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...
Anubhav Jain
 
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
University of California, San Diego
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
Anqi Fu
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9p
Robert Grossman
 
Project Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefProject Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster Relief
Robert Grossman
 
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
University of California, San Diego
 
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataThe DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
Anubhav Jain
 
Research Automation for Data-Driven Discovery
Research Automationfor Data-Driven DiscoveryResearch Automationfor Data-Driven Discovery
Research Automation for Data-Driven Discovery
Globus
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
Marco Quartulli
 
The Galaxy bioinformatics workflow environment
The Galaxy bioinformatics workflow environmentThe Galaxy bioinformatics workflow environment
The Galaxy bioinformatics workflow environment
Rutger Vos
 
Reproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter NotebookReproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter Notebook
Keiichiro Ono
 
Building Reproducible Network Data Analysis / Visualization Workflows
Building Reproducible Network Data Analysis / Visualization WorkflowsBuilding Reproducible Network Data Analysis / Visualization Workflows
Building Reproducible Network Data Analysis / Visualization Workflows
Keiichiro Ono
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
Anubhav Jain
 

What's hot (20)

Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud Automation
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
 
Big linked geospatial data tools in ExtremeEarth-phiweek19
Big linked geospatial data tools in ExtremeEarth-phiweek19Big linked geospatial data tools in ExtremeEarth-phiweek19
Big linked geospatial data tools in ExtremeEarth-phiweek19
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...
 
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9p
 
Project Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefProject Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster Relief
 
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
 
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataThe DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
 
Research Automation for Data-Driven Discovery
Research Automationfor Data-Driven DiscoveryResearch Automationfor Data-Driven Discovery
Research Automation for Data-Driven Discovery
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
The Galaxy bioinformatics workflow environment
The Galaxy bioinformatics workflow environmentThe Galaxy bioinformatics workflow environment
The Galaxy bioinformatics workflow environment
 
Reproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter NotebookReproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter Notebook
 
Building Reproducible Network Data Analysis / Visualization Workflows
Building Reproducible Network Data Analysis / Visualization WorkflowsBuilding Reproducible Network Data Analysis / Visualization Workflows
Building Reproducible Network Data Analysis / Visualization Workflows
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 

Viewers also liked

Scientometrics
ScientometricsScientometrics
Scientometrics
Tanu Malik
 
Cara membuat Tahu Teote Kelompok Mawar
Cara membuat Tahu Teote Kelompok MawarCara membuat Tahu Teote Kelompok Mawar
Cara membuat Tahu Teote Kelompok Mawar
Hanna Chan
 
Masetero haz profundo
Masetero haz profundoMasetero haz profundo
Masetero haz profundo
SE Bas
 
Case RH e ERP Grupo Morena Rosa - Vetorh® e Sapiens®
Case RH e ERP Grupo Morena Rosa - Vetorh® e Sapiens®Case RH e ERP Grupo Morena Rosa - Vetorh® e Sapiens®
Case RH e ERP Grupo Morena Rosa - Vetorh® e Sapiens®
Senior Sistemas
 
Paradigmas emergentes
Paradigmas emergentesParadigmas emergentes
Paradigmas emergentes
Oscar Serrano
 
Las palmas
Las palmasLas palmas
Las palmas
denissestfy
 
Tamer Samir Abbas 2-3-2016
Tamer Samir Abbas 2-3-2016Tamer Samir Abbas 2-3-2016
Tamer Samir Abbas 2-3-2016
Tamer Samir
 
ACE
ACEACE
Blog
BlogBlog
An Economic Situation Update with Bruce Yandle
An Economic Situation Update with Bruce YandleAn Economic Situation Update with Bruce Yandle
An Economic Situation Update with Bruce Yandle
Mercatus Center
 
Case RH e SEGURANÇA Fiat - Vetorh® Ronda® Senior
Case RH e SEGURANÇA Fiat - Vetorh® Ronda® SeniorCase RH e SEGURANÇA Fiat - Vetorh® Ronda® Senior
Case RH e SEGURANÇA Fiat - Vetorh® Ronda® Senior
Senior Sistemas
 
Social entrepreneurship : health
Social entrepreneurship : healthSocial entrepreneurship : health
Social entrepreneurship : health
Melvin Tay Jun Wei
 
Auditing and Maintaining Provenance in Software Packages
Auditing and Maintaining Provenance in Software PackagesAuditing and Maintaining Provenance in Software Packages
Auditing and Maintaining Provenance in Software Packages
Tanu Malik
 
Designing JEE Application Structure
Designing JEE Application StructureDesigning JEE Application Structure
Designing JEE Application Structure
odedns
 
Future Technology Trends and Predictions
Future Technology Trends and PredictionsFuture Technology Trends and Predictions
Future Technology Trends and Predictions
Eric Hansen, ISA
 
Metabolismo bacteriano
Metabolismo bacteriano Metabolismo bacteriano
Metabolismo bacteriano
Altagracia Diaz
 

Viewers also liked (16)

Scientometrics
ScientometricsScientometrics
Scientometrics
 
Cara membuat Tahu Teote Kelompok Mawar
Cara membuat Tahu Teote Kelompok MawarCara membuat Tahu Teote Kelompok Mawar
Cara membuat Tahu Teote Kelompok Mawar
 
Masetero haz profundo
Masetero haz profundoMasetero haz profundo
Masetero haz profundo
 
Case RH e ERP Grupo Morena Rosa - Vetorh® e Sapiens®
Case RH e ERP Grupo Morena Rosa - Vetorh® e Sapiens®Case RH e ERP Grupo Morena Rosa - Vetorh® e Sapiens®
Case RH e ERP Grupo Morena Rosa - Vetorh® e Sapiens®
 
Paradigmas emergentes
Paradigmas emergentesParadigmas emergentes
Paradigmas emergentes
 
Las palmas
Las palmasLas palmas
Las palmas
 
Tamer Samir Abbas 2-3-2016
Tamer Samir Abbas 2-3-2016Tamer Samir Abbas 2-3-2016
Tamer Samir Abbas 2-3-2016
 
ACE
ACEACE
ACE
 
Blog
BlogBlog
Blog
 
An Economic Situation Update with Bruce Yandle
An Economic Situation Update with Bruce YandleAn Economic Situation Update with Bruce Yandle
An Economic Situation Update with Bruce Yandle
 
Case RH e SEGURANÇA Fiat - Vetorh® Ronda® Senior
Case RH e SEGURANÇA Fiat - Vetorh® Ronda® SeniorCase RH e SEGURANÇA Fiat - Vetorh® Ronda® Senior
Case RH e SEGURANÇA Fiat - Vetorh® Ronda® Senior
 
Social entrepreneurship : health
Social entrepreneurship : healthSocial entrepreneurship : health
Social entrepreneurship : health
 
Auditing and Maintaining Provenance in Software Packages
Auditing and Maintaining Provenance in Software PackagesAuditing and Maintaining Provenance in Software Packages
Auditing and Maintaining Provenance in Software Packages
 
Designing JEE Application Structure
Designing JEE Application StructureDesigning JEE Application Structure
Designing JEE Application Structure
 
Future Technology Trends and Predictions
Future Technology Trends and PredictionsFuture Technology Trends and Predictions
Future Technology Trends and Predictions
 
Metabolismo bacteriano
Metabolismo bacteriano Metabolismo bacteriano
Metabolismo bacteriano
 

Similar to GEN: A Database Interface Generator for HPC Programs

Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
Rusif Eyvazli
 
Map reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportMap reduce advantages over parallel databases report
Map reduce advantages over parallel databases report
Ahmad El Tawil
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
Anubhav Jain
 
123448572 all-in-one-informatica
123448572 all-in-one-informatica123448572 all-in-one-informatica
123448572 all-in-one-informatica
homeworkping9
 
Hadoop
HadoopHadoop
Handout3o
Handout3oHandout3o
Handout3o
Shahbaz Sidhu
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Bertram Ludäscher
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET Journal
 
Clustering_Algorithm_DR
Clustering_Algorithm_DRClustering_Algorithm_DR
Clustering_Algorithm_DR
Nguyen Tran
 
ICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database Virtualization
Boris Glavic
 
Poster (1)
Poster (1)Poster (1)
Poster (1)
Daniel Osei
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoop
dbpublications
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
sreehari orienit
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
Paco Nathan
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
Greg Landrum
 
Model checking
Model checkingModel checking
Model checking
Richard Ashworth
 
[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T
[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T
[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T
IJET - International Journal of Engineering and Techniques
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
Robert Grossman
 
Efficient Log Management using Oozie, Parquet and Hive
Efficient Log Management using Oozie, Parquet and HiveEfficient Log Management using Oozie, Parquet and Hive
Efficient Log Management using Oozie, Parquet and Hive
Gopi Krishnan Nambiar
 

Similar to GEN: A Database Interface Generator for HPC Programs (20)

Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
 
Map reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportMap reduce advantages over parallel databases report
Map reduce advantages over parallel databases report
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
123448572 all-in-one-informatica
123448572 all-in-one-informatica123448572 all-in-one-informatica
123448572 all-in-one-informatica
 
Hadoop
HadoopHadoop
Hadoop
 
Handout3o
Handout3oHandout3o
Handout3o
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
Clustering_Algorithm_DR
Clustering_Algorithm_DRClustering_Algorithm_DR
Clustering_Algorithm_DR
 
ICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database Virtualization
 
Poster (1)
Poster (1)Poster (1)
Poster (1)
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoop
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
 
Model checking
Model checkingModel checking
Model checking
 
[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T
[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T
[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Efficient Log Management using Oozie, Parquet and Hive
Efficient Log Management using Oozie, Parquet and HiveEfficient Log Management using Oozie, Parquet and Hive
Efficient Log Management using Oozie, Parquet and Hive
 

Recently uploaded

Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 

Recently uploaded (20)

Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 

GEN: A Database Interface Generator for HPC Programs

  • 1. GEN:A Database Interface Generator for HPC Programs Quan Pham,Tanu Malik Computation Institute University of Chicago and Argonne National Laboratory tanum@uchicago.edu Monday, June 29, 15
  • 2. Computational Science Challenges • Science is iterative and data-intensive • Simulate-Analyze-Simulate • Supercomputing time is precious • So reduce time for data analysis • Challenges • Slow IO and network bandwidth • How to remove redundant and IO- intensive steps? Monday, June 29, 15
  • 3. Computational Cosmology (current state-of-art) • Data sizes reduce as analysis proceeds • Simulation size is few Petabytes • Analysis data may range from TBs to 10s of GBs Simulations Scripts Visualization Database Analyze Halo/sub-halo Identification Halo Merger Trees Identifying GalaxiesPVFS FS Simulate Madduri, Malik, Habib, et. al. PDACS: A Portal for Data Analysis Services for Cosmological Simulations. In: ACM Computing In Science and Engineering, 2015 Monday, June 29, 15
  • 4. Computational Cosmology (envisioned state-of-art) • Database-friendly analysis within the DB • Mechanisms for direct IO to databases Simulations Scripts Visualization Database Analyze Halo/sub-halo Identification Halo Merger Trees Identifying GalaxiesPVFS Simulate Monday, June 29, 15
  • 5. Related Work • The SciHadoop System • Move data between file-systems (from PVFS to HDFS) • Simple sub-selection and aggregation queries • Querying using HPC libraries • Transform sub-selection and aggregation SQL queries into parallel IO operations using parallel-NetCDF • Requires user to perform parallel data management • Database-as-a-service • RESTful service approach good for small amount of data Monday, June 29, 15
  • 6. GEN:A Database Interface Generator • Takes user-supplied C declarations and provides an interface to load into and access data from common scientific array databases Monday, June 29, 15
  • 7. GEN Overview MPI_Init ( &argc, &argv ); MPI_Comm_rank ( MPI_COMM_WORLD, &id ); MPI_Comm_size ( MPI_COMM_WORLD, &p ); if ( rank == 0 ) { x_file = fopen ( "x_data.txt", "w" ); for ( j = 0; j < n; j++ ) { for ( i = 0; i < m; i++ ) fprintf ( output, " %24.16g", table[i+j*m] ); fprintf ( output, "n" ); } fclose ( x_file ); } Figure 2: A sample program I/O MPI_Init ( &argc, &argv ); MPI_Comm_rank ( MPI_COMM_WORLD, &id ); MPI_Comm_size ( MPI_COMM_WORLD, &p ); if ( rank == 0 ) (ld.so) to load the specified shared libraries. Symbol defin tions in LD_PRELOADed libraries will be found before defin tions in non-LD_PRELOADed libraries, so if an LD_PRELOADe library contains a definition of, say, fopen, then that cod will be executed when the process references fopen. In pa ticular, this means that we can package our wrapper fun tions into a shared library (.so file) and use LD_PRELOAD t cause them to be loaded. The LD_PRELOAD directive helps to direct system calls t GEN’s wrapper functions, but at run-time GEN’s wrapper fun tions still does not have the handle to the original syste call functions, the arguments of which are necessary to cr ate corresponding database DDL statements in the wrapp function and then execute them. 3.2 Creating Database DDL statements To create database DDL statements we need the handle t the original system call function and use the parameters the original system call function create appropriate databa DDL statements. On Unix platforms, the dlsym functio dlsym(void* handle, char* symbol) returns the addre of the symbol given as its second argument, searching the dy namic or shared library accessible through the handle give as its first argument. Thus, assuming the only definitions fopen are in our shared library and in the system C librar fprintf ( output, " %24.16g", table[i+j*m] ); fprintf ( output, "n" ); } fclose ( x_file ); } Figure 2: A sample program I/O MPI_Init ( &argc, &argv ); MPI_Comm_rank ( MPI_COMM_WORLD, &id ); MPI_Comm_size ( MPI_COMM_WORLD, &p ); if ( rank == 0 ) { CREATE TABLE "x_data.txt" for ( j = 0; j < n; j++ ) { for ( i = 0; i < m; i++ ) INSERT into "x_data.txt" VALUES (table[i+j*m]) fprintf ( output, "n" ); } fclose ( x_file ); } Figure 3: Incorrect Database Interface of Sample Program To generate the correct interface, GEN relies on (a) run- time wrapping of POSIX I/O calls, (b) run-time bookkeep- ing information for creation of database DDL statements,Monday, June 29, 15
  • 8. Issues • IO patterns • N-1 non-strided or N-1 strided or N-N • Scattered IO calls • No direct mapping between POSIX calls and DB statements • Direct C POSIX calls to DB statements • Constraints: • No source code changes Monday, June 29, 15
  • 9. GEN • Assumptions • Serial IO for now • IO done contiguously within the context of a single function Monday, June 29, 15
  • 10. GEN-1 • Map system calls to DDL statements without changing source code • LD_PRELOAD: load a shared library • Use LD_PRELOAD to force load a library that has GEN’s implementation of fopen(), fwrite(), fread() fopen("x_data.txt", w") Create <Database Object> x_data.txt.tmp <structure> Monday, June 29, 15
  • 11. GEN-1I • Schema-later; no structure initially, reorganize later fopen("x_data.txt", w") Create Table x_data.txt.tmp <i int; x double>; Create Array x_data.txt.tmp <x:double> [i=0:*] • OS-DB impedance mismatch • Not every fwrite() translates to INSERT statement • fwrites are buffered before being sent to filesystem • A single statement when data is being flushed Monday, June 29, 15
  • 12. Knowing the Structure • User-defined • Header files • Allow some system calls to create sample files Monday, June 29, 15
  • 13. Experiments on Merger Trees • Building merger- trees in DB system • Built from halo particles, i.e. particles in a given halo • Several time-steps • Perform a particle merge-join between halos of timestep ti-1 and ti • Lineage queries to answer ancestors and progenitors • MPI program • Streams data to a single node, which streams to GEN • Postgres and SciDB interfaces Monday, June 29, 15
  • 14. Experimental Setup • Time for reorganizing the data • Reorganizing is cheaper in the DB than outside n, such as e), 1, fout); re to which data belongs to, aration specified in header tributes and type informa- e flattened out as attributes ase, and as attributes of a ray database. Thus nested ined by a foreign key. Using bookkeeping we can create ntf statements and sprintf ctly From Program to responding INSERT/LOAD he operating system-database ere in a fwrite() in a C pro- INSERT/LOAD statement of system libraries in which ng to the OS kernel and the treaming writes to improve in this performance advan- atement for the fwrite, but nto the temporary table of to write. Thus for the pre- variation of 0.02. The redirection time is not influenced by increasing the number of elements to write in the program, since the libraries are already loaded in memory. The varia- tion comes if we execute other programs, which can o✏oad the library from memory and it needs to reloaded again. In the second experiment, we measured the space vs time overhead of reorganizing the data. In GEN we have not saved any space. Without GEN the binary data is written out on files and then within the database. With GEN the data is written out to temporary tables within the database and then reorganized within the database. However, as Table 1 shows that there is significant savings in terms of reorganiz- ing the data within the database versus reorganizing prior to loading the database. Table shows the time it takes to load increasingly larger sized datasets into a relational and an array database vs how much time it takes to load the same data from a temporary table into a formatted table plus the time of loading the initial binary data into a temporary table. Table 1: Reorganization times Data Size Postgres SciDB (in GB) Load from Files Reorganize within DB Load from Files Reorganize within DB 1 2.7 1.2 8.4 3.2 5 6.8 2.5 14.3 7.6 10 10.56 4.7 22.6 8.9 • Overhead of redirection • Dynamically-linked library may get off- loaded • 0.04 ms + 0.02ms- Monday, June 29, 15
  • 15. Conclusions and Current Work • A database generator library for data- intensive scientific applications that requires no source code changes, which loads to be reorganized later. • Relax the assumptions • Loading is still an issue • Source-code release Monday, June 29, 15
  • 16. • ThankYou! • Acknowledgements • Salman Habib, Katrin Heitmann, Steve Rangel, Hal Finkel, Ravi Madduri Monday, June 29, 15