Importing data into database table from input data file/s
DebabrataMondal
April 30, 2012
Abstract
Background: Biological data set being very large in size that it is very normal to have a file size of fewer
megabytes to few gigabytes. Doing biological data analysis even using Microsoft Excel 7 on such large
data set often becomes a head-ach to many researchers in the field of Biotechnology. Usual scenario like
machine hang due to memory over load, application crash results in time wasting repeat work.
Aim: We take an approach to make a solution to this problem so that our researchers can work
comfortably. We come to the decision, why should not we try to design an application like Microsoft
Excel 7 enriched with features of our researcher’s requirements for biological and statistical analysis on
biological data set? It is obviously not an impossible task but how should we manage our data for analysis
avoiding memory over load, application crash? An idea lit in our mind, if it is possible to store our data in
database and our application fetch required set of data and serve it to the user at run time like an on
demand service. Looks easy to say but need efficient and stable designing of the application! We need to
remember, biological raw data generated in different specified format in different Wait Lab using
different advance technology driven machines. So let’s start to think how such designing is possible and
implement the application software like Microsoft Excel 7!
Method: We can design and implement the Graphical User Interface for our application as usual way in
any open source platform. Now we go to the key problem, data handling. First, for converting raw data to
a common set of standard format we can keep a module to dynamically configure input raw data format.
Next, our application should have a set of well customized modules to accept input data files, pass
through a pre-processing phase to suppress any unwanted information in the data file and make the data
well set to import into the database. This data importing mechanism can be described in the form of a
simple logical manner, consider a small/large data set as a rectangular paper which should be stored in a
single table in the database logically. First of all this so called data set can be in single or multiple data
files, consider as cutting the so called rectangular paper into rectangular pieces. Even all these files may
not be supplied at a time! Our application must have a special engine to import data supplied in both/all
types. And most important, say once a data file supplied containing fields or columns sequence 1093,
1094, 1095, 1096, 1097, 1098, 1099 which is required to import into a table (think logically single table)
and there are new columns. Next time, a data file supplied containing fields or columns sequence 1096,
1093, 1099, 1094, 1097 but rows are new and this data is required to import in the same table. The engine
must rearrange the data column wise before import it. Up to this we talk about input data file/s. Now, it is
time for the logically considered single table in database. Any DBMS system has a maximum limit of
columns it can store. No need to worry about row limit. So our engine must have a module for automatic
runtime data clustering to enable logically single table to store in the database as multiple tables bounded
by maximum columns limit. Reverse case, when this data is fetched (Queries can be classified as, Row
selection, Wildcard search, Field selection, Comparison search, Aggregate search, List search, etc), the
fetch operation must consider data set as a single table logically. This must be applicable for all data
manipulation operations on the table. Our design planning now is in an imaginable stage as if viewing and
editing data more or less likes Microsoft Excel 2007.
Some more generic features can be added like, (A) Implementation of almost all required data definition
queries for enabling normal add, remove columns in a table bounded by column limit. This is required for
storing result data generated by applying biological and statistical analysis on experimental data. (B)
Transposing of very large data file containing data in tabular format in reasonably very short time. (C)
Unlimited undo/redo facilities for update and delete operation on the table data. (D) Some standard
analysis like “genotype count”, “Allele Frequency”, “HWE”, “Cryptic Relationship” can be done on the
queried data and can be stored in the database or can be exported to the disk.
Results & Conclusion: We have already developed an application software “Database migration and
management tool” short name DBSERVER, version 1.0 in Qt 3.3.5, Fedora Core 5 which have almost
attained all targets and our researchers using it. There is scope to add modify more features in this
application and making it more robust and user friendly.

Abstract.DOCX

  • 1.
    Importing data intodatabase table from input data file/s DebabrataMondal April 30, 2012 Abstract Background: Biological data set being very large in size that it is very normal to have a file size of fewer megabytes to few gigabytes. Doing biological data analysis even using Microsoft Excel 7 on such large data set often becomes a head-ach to many researchers in the field of Biotechnology. Usual scenario like machine hang due to memory over load, application crash results in time wasting repeat work. Aim: We take an approach to make a solution to this problem so that our researchers can work comfortably. We come to the decision, why should not we try to design an application like Microsoft Excel 7 enriched with features of our researcher’s requirements for biological and statistical analysis on biological data set? It is obviously not an impossible task but how should we manage our data for analysis avoiding memory over load, application crash? An idea lit in our mind, if it is possible to store our data in database and our application fetch required set of data and serve it to the user at run time like an on demand service. Looks easy to say but need efficient and stable designing of the application! We need to remember, biological raw data generated in different specified format in different Wait Lab using different advance technology driven machines. So let’s start to think how such designing is possible and implement the application software like Microsoft Excel 7! Method: We can design and implement the Graphical User Interface for our application as usual way in any open source platform. Now we go to the key problem, data handling. First, for converting raw data to a common set of standard format we can keep a module to dynamically configure input raw data format. Next, our application should have a set of well customized modules to accept input data files, pass through a pre-processing phase to suppress any unwanted information in the data file and make the data well set to import into the database. This data importing mechanism can be described in the form of a simple logical manner, consider a small/large data set as a rectangular paper which should be stored in a single table in the database logically. First of all this so called data set can be in single or multiple data files, consider as cutting the so called rectangular paper into rectangular pieces. Even all these files may not be supplied at a time! Our application must have a special engine to import data supplied in both/all types. And most important, say once a data file supplied containing fields or columns sequence 1093, 1094, 1095, 1096, 1097, 1098, 1099 which is required to import into a table (think logically single table) and there are new columns. Next time, a data file supplied containing fields or columns sequence 1096, 1093, 1099, 1094, 1097 but rows are new and this data is required to import in the same table. The engine must rearrange the data column wise before import it. Up to this we talk about input data file/s. Now, it is time for the logically considered single table in database. Any DBMS system has a maximum limit of
  • 2.
    columns it canstore. No need to worry about row limit. So our engine must have a module for automatic runtime data clustering to enable logically single table to store in the database as multiple tables bounded by maximum columns limit. Reverse case, when this data is fetched (Queries can be classified as, Row selection, Wildcard search, Field selection, Comparison search, Aggregate search, List search, etc), the fetch operation must consider data set as a single table logically. This must be applicable for all data manipulation operations on the table. Our design planning now is in an imaginable stage as if viewing and editing data more or less likes Microsoft Excel 2007. Some more generic features can be added like, (A) Implementation of almost all required data definition queries for enabling normal add, remove columns in a table bounded by column limit. This is required for storing result data generated by applying biological and statistical analysis on experimental data. (B) Transposing of very large data file containing data in tabular format in reasonably very short time. (C) Unlimited undo/redo facilities for update and delete operation on the table data. (D) Some standard analysis like “genotype count”, “Allele Frequency”, “HWE”, “Cryptic Relationship” can be done on the queried data and can be stored in the database or can be exported to the disk. Results & Conclusion: We have already developed an application software “Database migration and management tool” short name DBSERVER, version 1.0 in Qt 3.3.5, Fedora Core 5 which have almost attained all targets and our researchers using it. There is scope to add modify more features in this application and making it more robust and user friendly.