The document describes user defined functions for performing interpolation on environmental sensor data. It discusses several approaches to implementing interpolation functions, including cursor-based, data structure-based, writing to a file, table valued functions, and chunk processing. It also describes how SQL queries containing grid references are transformed by parsing the queries and replacing the grid with a user defined interpolation function call. Performance tests on the different approaches show the file-based and chunk processing methods have the best performance for large datasets.
Survey on Load Rebalancing for Distributed File System in CloudAM Publications
Distributed file system is used as a key building block of cloud computing. In distributed file system, a
large file is divided into number of chunks and allocates each chunk to separate node to perform MapReduce function
parallel over each node. In cloud, if number of storage nodes, number of files and assesses to that file increases then
the central node (master in MapReduce) becomes bottleneck. The load rebalancing task is used to eliminate the load
on central node. Using load rebalancing algorithm the load of nodes is balanced as well as the movement cost is
reduced. In this survey paper the problem of load imbalancing is overcome.
Poster given at the Bioinformatics Open Source Conference (BOSC) and Intelligent Systems for Molecular Biology (ISMB) Conference held July 8-12, 2016 in Orlando, Florida, entitled GRNmap and GRNsight: open source software for dynamical systems modeling and visualization of medium-scale gene regulatory networks. Abstract: A gene regulatory network (GRN) consists of genes, transcription factors, and the regulatory connections between them that govern the level of expression of mRNA and proteins from those genes. Our open source MATLAB software package, GRNmap (http://kdahlquist.github.io/GRNmap/), uses ordinary differential equations to model the dynamics of medium-scale GRNs. The program uses a penalized least squares approach (Dahlquist et al. 2015, DOI: 10.1007/s11538-015-0092-6) to estimate production rates, expression thresholds, and regulatory weights for each transcription factor in the network based on gene expression data, and then performs a forward simulation of the dynamics of the network. GRNmap has options for using a sigmoidal or Michaelis-Menten production function. Parameters for a series of related networks, ranging in size from 15 to 35 genes, were optimized against DNA microarray data measuring the transcriptional response to cold shock in budding yeast, Saccharomyces cerevisiae, for the wild type strain and strains deleted for the transcription factors Cin5, Gln3, Hap4, Hmo1, and Zap1, giving biological insights into this process. GRNsight is an open source web application for visualizing such models of gene regulatory networks (http://dondi.github.io/GRNsight/index.html). GRNsight accepts GRNmap- or user-generated Excel spreadsheets containing an adjacency matrix representation of the GRN and automatically lays out the graph. The application colors the edges and adjusts their thicknesses based on the sign (activation or repression) and the strength (magnitude) of the regulatory relationship, respectively. Users can then modify the graph to define the best visual layout for the network. This work was partially supported by NSF award 0921038.
A Tale of Data Pattern Discovery in ParallelJenny Liu
In the era of IoTs and A.I., distributed and parallel computing is embracing big data driven and algorithm focused applications and services. With rapid progress and development on parallel frameworks, algorithms and accelerated computing capacities, it still remains challenging on deliver an efficient and scalable data analysis solution. This talk shares a research experience on data pattern discovery in domain applications. In particular, the research scrutinizes key factors in analysis workflow design and data parallelism improvement on cloud.
Survey on Load Rebalancing for Distributed File System in CloudAM Publications
Distributed file system is used as a key building block of cloud computing. In distributed file system, a
large file is divided into number of chunks and allocates each chunk to separate node to perform MapReduce function
parallel over each node. In cloud, if number of storage nodes, number of files and assesses to that file increases then
the central node (master in MapReduce) becomes bottleneck. The load rebalancing task is used to eliminate the load
on central node. Using load rebalancing algorithm the load of nodes is balanced as well as the movement cost is
reduced. In this survey paper the problem of load imbalancing is overcome.
Poster given at the Bioinformatics Open Source Conference (BOSC) and Intelligent Systems for Molecular Biology (ISMB) Conference held July 8-12, 2016 in Orlando, Florida, entitled GRNmap and GRNsight: open source software for dynamical systems modeling and visualization of medium-scale gene regulatory networks. Abstract: A gene regulatory network (GRN) consists of genes, transcription factors, and the regulatory connections between them that govern the level of expression of mRNA and proteins from those genes. Our open source MATLAB software package, GRNmap (http://kdahlquist.github.io/GRNmap/), uses ordinary differential equations to model the dynamics of medium-scale GRNs. The program uses a penalized least squares approach (Dahlquist et al. 2015, DOI: 10.1007/s11538-015-0092-6) to estimate production rates, expression thresholds, and regulatory weights for each transcription factor in the network based on gene expression data, and then performs a forward simulation of the dynamics of the network. GRNmap has options for using a sigmoidal or Michaelis-Menten production function. Parameters for a series of related networks, ranging in size from 15 to 35 genes, were optimized against DNA microarray data measuring the transcriptional response to cold shock in budding yeast, Saccharomyces cerevisiae, for the wild type strain and strains deleted for the transcription factors Cin5, Gln3, Hap4, Hmo1, and Zap1, giving biological insights into this process. GRNsight is an open source web application for visualizing such models of gene regulatory networks (http://dondi.github.io/GRNsight/index.html). GRNsight accepts GRNmap- or user-generated Excel spreadsheets containing an adjacency matrix representation of the GRN and automatically lays out the graph. The application colors the edges and adjusts their thicknesses based on the sign (activation or repression) and the strength (magnitude) of the regulatory relationship, respectively. Users can then modify the graph to define the best visual layout for the network. This work was partially supported by NSF award 0921038.
A Tale of Data Pattern Discovery in ParallelJenny Liu
In the era of IoTs and A.I., distributed and parallel computing is embracing big data driven and algorithm focused applications and services. With rapid progress and development on parallel frameworks, algorithms and accelerated computing capacities, it still remains challenging on deliver an efficient and scalable data analysis solution. This talk shares a research experience on data pattern discovery in domain applications. In particular, the research scrutinizes key factors in analysis workflow design and data parallelism improvement on cloud.
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease
and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in
exploratory data analysis. This paper presents results of the experimental study of different approaches to
k- Means clustering, thereby comparing results on different datasets using Original k-Means and other
modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance
measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and
execution time
This slide deck is used as an introduction to Relational Algebra and its relation to the MapReduce programming model, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
A comparison of efficient algorithms for scheduling parallel data redistributionIJCNCJournal
Data redistribution in parallel is an often-address
ed issue in modern computer networks. In this conte
xt, we
study the case of data redistribution over a switch
ing network. Data from the source stations need to
be
transferred to the destination stations in the mini
mum time possible. Unfortunately the time required
to
complete the transfer is burdened by each switching
and thus producing an optimal schedule is proven t
o
be computationally intractable. For the purposes of
this paper we consider two algorithms, which have
been proved to be very efficient in the past. To ge
t improved results in comparison to previous approa
ches,
we propose splitting the data in two clusters depen
ding on the size of the data to be transferred. To
prove
the efficiency of our approach we ran experiments o
n all three algorithms, comparing the time span of
the
schedules produced as well as the running times to
produce those schedules. The test cases we ran
indicate that not only our newly proposed algorithm
yields better results in terms of the schedule pro
duced
but runs faster as well.
Variable neighborhood Prediction of temporal collective profiles by Keun-Woo ...EuroIoTa
Temporal collective profiles generated by mobile network users can be used to predict network usage, which in turn can be used to improve the performance of the network to meet user demands. This presentation will talk about a prediction method of temporal collective profiles which is suitable for online network management. Using weighted graph representation, the target sample is observed during a given period to determine a set of neighboring profiles that are considered to behave similarly enough. The prediction of the target profile is based on the weighted average of its neighbors, where the optimal number of neighbors are selected through a form of variable neighborhood search. This method is applied to two datasets, one provided by a mobile network service provider and the other from a Wi-Fi service provider. The proposed prediction method can conveniently characterize user behavior via graph representation, while outperforming existing prediction methods. Also, unlike existing methods that utilize categorization, it has a low computational complexity, which makes it suitable for online network analysis.
Soft computing based cryptographic technique using kohonen's selforganizing m...ijfcstjournal
In this paper a novel soft computing based cryptographic technique based on synchronization of two
Kohonen's Self-Organizing Feature Map (KSOFM) between sender and receiver has been proposed. In this
proposed technique KSOFM based synchronization is performed for tuning both sender and receiver. After
the completion of the tuning identical session key get generates at the both sender and receiver end with the
help of synchronized KSOFM. This synchronized network can be used for transmitting message using any
light weight encryption/decryption algorithm with the help of identical session key of the synchronized
network. Exhaustive parametric tests are done and results are compared with some existing classical
techniques, which show comparable results for the proposed system.
Abstract: With development of the information technology, the scale of data is increasing quickly. The massive data poses a great challenge for data processing and classification. In order to classify the data, there were several algorithm proposed to efficiently cluster the data. One among that is the random forest algorithm, which is used for the feature subset selection. The feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. It is achieved by classifying the given data. The efficiency is calculated based on the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. The existing system deals with fast clustering based feature selection algorithm, which is proven to be powerful, but when the size of the dataset increases rapidly, the current algorithm is found to be less efficient as the clustering of datasets takes quiet more number of time. Hence the new method of implementation is proposed in this project to efficiently cluster the data and persist on the back-end database accordingly to reduce the time. It is achieved by scalable random forest algorithm. The Scalable random forest is implemented using Map Reduce Programming (An implementation of Big Data) to efficiently cluster the data. In works on two phases, the first step deals with the gathering the datasets and persisting on the datastore and the second step deals with the clustering and classification of data. This process is completely implemented using Google App Engine’s hadoop platform, which is a widely used open-source implementation of Google's distributed file system using MapReduce framework for scalable distributed computing or cloud computing. MapReduce programming model provides an efficient framework for processing large datasets in an extremely parallel mining. And it comes to being the most popular parallel model for data processing in cloud computing platform. However, designing the traditional machine learning algorithms with MapReduce programming framework is very necessary in dealing with massive datasets.Keywords: Data mining, Hadoop, Map Reduce, Clustering Tree.
Title: Big Data on Implementation of Many to Many Clustering
Author: Ravi. R, Michael. G
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
Scalable Rough C-Means clustering using Firefly algorithm..................................................................1
Abhilash Namdev and B.K. Tripathy
Significance of Embedded Systems to IoT................................................................................................. 15
P. R. S. M. Lakshmi, P. Lakshmi Narayanamma and K. Santhi Sri
Cognitive Abilities, Information Literacy Knowledge and Retrieval Skills of Undergraduates: A
Comparison of Public and Private Universities in Nigeria ........................................................................ 24
Janet O. Adekannbi and Testimony Morenike Oluwayinka
Risk Assessment in Constructing Horseshoe Vault Tunnels using Fuzzy Technique................................ 48
Erfan Shafaghat and Mostafa Yousefi Rad
Evaluating the Adoption of Deductive Database Technology in Augmenting Criminal Intelligence in
Zimbabwe: Case of Zimbabwe Republic Police......................................................................................... 68
Mahlangu Gilbert, Furusa Samuel Simbarashe, Chikonye Musafare and Mugoniwa Beauty
Analysis of Petrol Pumps Reachability in Anand District of Gujarat ....................................................... 77
Nidhi Arora
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease
and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in
exploratory data analysis. This paper presents results of the experimental study of different approaches to
k- Means clustering, thereby comparing results on different datasets using Original k-Means and other
modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance
measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and
execution time
This slide deck is used as an introduction to Relational Algebra and its relation to the MapReduce programming model, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
A comparison of efficient algorithms for scheduling parallel data redistributionIJCNCJournal
Data redistribution in parallel is an often-address
ed issue in modern computer networks. In this conte
xt, we
study the case of data redistribution over a switch
ing network. Data from the source stations need to
be
transferred to the destination stations in the mini
mum time possible. Unfortunately the time required
to
complete the transfer is burdened by each switching
and thus producing an optimal schedule is proven t
o
be computationally intractable. For the purposes of
this paper we consider two algorithms, which have
been proved to be very efficient in the past. To ge
t improved results in comparison to previous approa
ches,
we propose splitting the data in two clusters depen
ding on the size of the data to be transferred. To
prove
the efficiency of our approach we ran experiments o
n all three algorithms, comparing the time span of
the
schedules produced as well as the running times to
produce those schedules. The test cases we ran
indicate that not only our newly proposed algorithm
yields better results in terms of the schedule pro
duced
but runs faster as well.
Variable neighborhood Prediction of temporal collective profiles by Keun-Woo ...EuroIoTa
Temporal collective profiles generated by mobile network users can be used to predict network usage, which in turn can be used to improve the performance of the network to meet user demands. This presentation will talk about a prediction method of temporal collective profiles which is suitable for online network management. Using weighted graph representation, the target sample is observed during a given period to determine a set of neighboring profiles that are considered to behave similarly enough. The prediction of the target profile is based on the weighted average of its neighbors, where the optimal number of neighbors are selected through a form of variable neighborhood search. This method is applied to two datasets, one provided by a mobile network service provider and the other from a Wi-Fi service provider. The proposed prediction method can conveniently characterize user behavior via graph representation, while outperforming existing prediction methods. Also, unlike existing methods that utilize categorization, it has a low computational complexity, which makes it suitable for online network analysis.
Soft computing based cryptographic technique using kohonen's selforganizing m...ijfcstjournal
In this paper a novel soft computing based cryptographic technique based on synchronization of two
Kohonen's Self-Organizing Feature Map (KSOFM) between sender and receiver has been proposed. In this
proposed technique KSOFM based synchronization is performed for tuning both sender and receiver. After
the completion of the tuning identical session key get generates at the both sender and receiver end with the
help of synchronized KSOFM. This synchronized network can be used for transmitting message using any
light weight encryption/decryption algorithm with the help of identical session key of the synchronized
network. Exhaustive parametric tests are done and results are compared with some existing classical
techniques, which show comparable results for the proposed system.
Abstract: With development of the information technology, the scale of data is increasing quickly. The massive data poses a great challenge for data processing and classification. In order to classify the data, there were several algorithm proposed to efficiently cluster the data. One among that is the random forest algorithm, which is used for the feature subset selection. The feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. It is achieved by classifying the given data. The efficiency is calculated based on the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. The existing system deals with fast clustering based feature selection algorithm, which is proven to be powerful, but when the size of the dataset increases rapidly, the current algorithm is found to be less efficient as the clustering of datasets takes quiet more number of time. Hence the new method of implementation is proposed in this project to efficiently cluster the data and persist on the back-end database accordingly to reduce the time. It is achieved by scalable random forest algorithm. The Scalable random forest is implemented using Map Reduce Programming (An implementation of Big Data) to efficiently cluster the data. In works on two phases, the first step deals with the gathering the datasets and persisting on the datastore and the second step deals with the clustering and classification of data. This process is completely implemented using Google App Engine’s hadoop platform, which is a widely used open-source implementation of Google's distributed file system using MapReduce framework for scalable distributed computing or cloud computing. MapReduce programming model provides an efficient framework for processing large datasets in an extremely parallel mining. And it comes to being the most popular parallel model for data processing in cloud computing platform. However, designing the traditional machine learning algorithms with MapReduce programming framework is very necessary in dealing with massive datasets.Keywords: Data mining, Hadoop, Map Reduce, Clustering Tree.
Title: Big Data on Implementation of Many to Many Clustering
Author: Ravi. R, Michael. G
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
Scalable Rough C-Means clustering using Firefly algorithm..................................................................1
Abhilash Namdev and B.K. Tripathy
Significance of Embedded Systems to IoT................................................................................................. 15
P. R. S. M. Lakshmi, P. Lakshmi Narayanamma and K. Santhi Sri
Cognitive Abilities, Information Literacy Knowledge and Retrieval Skills of Undergraduates: A
Comparison of Public and Private Universities in Nigeria ........................................................................ 24
Janet O. Adekannbi and Testimony Morenike Oluwayinka
Risk Assessment in Constructing Horseshoe Vault Tunnels using Fuzzy Technique................................ 48
Erfan Shafaghat and Mostafa Yousefi Rad
Evaluating the Adoption of Deductive Database Technology in Augmenting Criminal Intelligence in
Zimbabwe: Case of Zimbabwe Republic Police......................................................................................... 68
Mahlangu Gilbert, Furusa Samuel Simbarashe, Chikonye Musafare and Mugoniwa Beauty
Analysis of Petrol Pumps Reachability in Anand District of Gujarat ....................................................... 77
Nidhi Arora
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Florent Renucci
(General) To retrieve a clean dataset by deleting outliers.
(Computer Vision) the recovery of a digital image that has been contaminated by additive white Gaussian noise.
A vm scheduling algorithm for reducing power consumption of a virtual machine...eSAT Journals
Abstract This paper concentrates on methods which provide efficient processing time of a virtual machine, CPU utilization time of a virtual machine. As the user increases, the performance may be significantly reduced if the tasks are not scheduled in a proper order. In this paper the performance of two already existing algorithms DSP (Dependency Structural Prioritization) algorithm and credit scheduling algorithm are analyzed and compared. A single virtual machine’s processing time and CPU utilization time are measured .Satisfactory results are achieved while comparing the two algorithms. This study concludes that the DSP algorithm can perform efficiently than the credit scheduling algorithm. Keywords: Virtual Machine, DSP algorithm, credit scheduling algorithm
A vm scheduling algorithm for reducing power consumption of a virtual machine...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
1. User defined functions for performing interpolation
1. Introduction
Interpolation or kringing is the process of finding out the sensor data values at all points of a grid.Environmental
scientists record the datavalues for each timestamp at all points where the sensors are located.Using these values we
find out the data values at all the points on a specified grid.A grid consists of a two dimensional set of points at
equal intervals on a map. Such a two-dimensional set of points on a map help to capture the environmental state of
the database at a particular timestamp.
Grid is a set of two dimensional points equally spaced. The distance between consecutive points is decided by the
granularity of the grid. It is a two-dimensional evaluation space. The values of a particular measurement are
calculated at each point of the grid. These values at all points of the grid represent the environmental state at a
particular timestamp.
2. System Architecture
2.1. Data smoothing:
Done as sql queries inside the database
The data first needs to be smoothened. The smoothening of data is performed as follows:
1) A reference point is considered. The difference of the data timestamp and the reference timestamp is divided into
n minute intervals. Each timestamp is put into n minute intervals. Each timestamp is put into its own interval by
calculating the floor of its division with n.
For example consider a data timestamp 11.40 PM,31-08-2007.The time difference between this time stamp and the
reference timestamp is calculated and divided into n minute interval(Supposing the data needs to be smoothed into n
minute intervals)Each timestamp is put into its own interval by calculating the floor of its division with n.
For example, consider a data timestamp 11:35 PM,31-08007.The time difference between this timestamp and the
reference timestamp is calculates in minutes. This difference in minutes is divided by 30(supposing the smoothening
needs to be performed in 30 min intervals).The floor of this division is calculated and multiplied by 30.This result is
added to the reference timestamp.
This mathematical calculation returns the bucket for this timestamp which is 11.30 pm, 31-08-07.All the timestamps
lying between 11.30 and 12 pm of 31-08007 fall into this bucket. The average of data values of all such rows is
calculated and is considered as the cumulative value at this timestamp (11.30 pm, 31-08-2007).
Once the data has been smoothened with reference to the time-interval given, this data is used as input to the user
defined functions written.
2.2. User defined functions
General working--
The user defined functions first takes as input the maximum, minimum values of x and y to calculate the grid for
which environmental state needs to be calculated. For each such grid point, the effect of all the sensors which
measure the input measurement must be found out.
For measuring the effect of each such sensor on the specified grid point, we take into consideration two things:
1) The distance of each sensor from the given grid point.
2) The value of the measurement at that particular sensor (please note that all the values mentioned here are with
respect to a particular timestamp)
The effect of each sensor on the value at a particular timestamp is:
1) Inversely proportional to the distance of sensor from the grid point
2) Directly proportional to value of sensor.
Hence, the equation that we use is:
Where CE represents the cumulative effect on a single grid point
V1 ,…Vn represent the values at sensor points 1..n.
d1,…dn represent the distance metric between the sensor point and the actual .
Therefore, for each grid point, the sensors calculating the given measurement are taken. This data is then
smoothened as described before according to a specified interval.
2. The smoothed data is then interpolated for each interval using the formula above. This is calculated for every
interval.
Sql query generation:queries are passed by the user oblivious to the processing that is taking place wrt the query.
parsing and transforming sql queries:The queries passed by the user are parsed and a query tree is formed.The
arguments for the user defined functions are extracted from this parse tree and passed to the functions.The functions
replace the grid wherever it is used in the query.
Cursor based approach:
This is the slowest user defined function in terms of performance.
Input: maximum, minimum co-ordinates of grid and measurement id.
Output: interpolated environmental state.
A cursor returns the grid points lying between maximum, minimum points given as input.
The cursor then gives each row of its result set to point-influence, a function which calculates influence of sensors at
each point for a given measurement. Inside this function, a cursor returns the smoothened data for a particular
measurement. It also orders them by time.
For each timestamp all the sensor data is used in the formula above to calculate influence at a particular grid
point.
This is carried for each timestamp. This user defined function performs slowly because a cursor reads each value
from the database one after the other. There is no buffer which stores a group from which we can read easily. As
each call has to do database I/O,the function is the slowest of all the other functions.
Data structure based approach:
It is possible to write CLR enabled functions using .net languages to perform interpolation.CLR enabled
functions query the database and return a data reader class. The data reader acts as a buffer for the queried data.
From this data reader each value can be read one at a time and no overhead occurs on database I/O.
But there is a drawback to such an approach. The environmental data, which spans across various sensors all over
Switzerland and across time spread throughout 3 years, is quite huge taking several terabytes. Storing a subset of
such data in data structures in programming is not advisable. As the grid size increases due to decrease of
granularity, the data structures might not be able to hold such a magnitude of data. There is a substitute to such an
anomaly in case it occurs. This substitute is discussed in the next function.
Write-file function:
In this function, we query the database for the relevant grid points and the results are returned as a data reader class.
Along with this, the results of the data value for a particular measurement are also queried inside the database.
These two data readers are enough to calculate the interpolation values at a particular timestamp.
Then, the second data reader is opened and queried for the next timestamp to calculate the grid values at this
timestamp and so on.
In ms sql 2005,to open two result sets at a time you need to put MARS(Multiple Active Result Sets) option to true.
This option is not enabled for ms sql 2005 native client. But ms sql server 2005 is the version used in swissex.
The workaround for this is querying the db for the first timestamp and storing the results in a local store file, which
is equivalent to caching the results in a data reader.
Therefore we write the results for the first timestamp in a file. Then use the results to calculate interpolation by
reading the file. The results are stored in the file as comma separated values (CSV).Then the results for the next
timestamp are overwritten in the file and so on.
Table valued function:
This is the sql version of the user defined functions that we have represented till now. In this function a
single sql statement performs the interpolation. This sql statement is parsed by the server itself which then performs
the QEP (Query execution plan).This plan is not optimized, hence the execution is slower than write file method.
Non-CLR function:
This function queries the database using an sql connection and retrieves result set rows one after the other.The
advantage of this function is that we can have multiple active result sets open. The drawback is more i/o as we need
to retrieve results from database row wise. This function performs the best among all the stored procedures so
far,due to multiple active result sets that we have open.
Chunk Processing function:
3. The processing is done completely outside the database in the form of chunks of uniform size instead of chunks
dependent on data for a particular timestamp.The performance is a little bit worser than csv file based function. This
can be because,though we have decreased the disk I/O,the burden of processing in "chunks" of tuples rather than
chunks of timestamps causes the program to store previous data in a hash map so that when the next chunk of data
related to the same timestamp is processed,it can use the previous data.
The algorithm uses the size of the data or the value of the timestamp as a stopping point(whichever of these comes
first).So,if we are processing in chunks of data of 20 tuples for example,If the data contains 42 tuples,it gets
processed in set of 20->20->and 2 tuples.Therefore,there is not only the burden of storing data from previous
processing,but also a wastage of space(and hence time) while processing the remaining 2 tuples.
Gridcalculate
(Cursor based)
ShowGrid
(Joinbased)
WriteFile
(CSV file Based)
Non-clr chunkprocessi
ng
1000 34 min 4min 15 seconds 1 min 56 seconds 1 min 32 secs 2 min 29s
10,000 NA 53 min 41 seconds 19 min 33 seconds 15 min 40
secs
25min 14s
50,000 NA 4 hours 21 min 50 sec 1 hour 39 min 25 sec 1 hr 6 min
38seconds
2 hr 15min
29s
1,00,000 NA NA 3 hour 16 min 19 sec 2 hr 19mi 31 4 hr 29 min
Table 2:Performance comparison of user defined functions.
Note: Experiment was carried out on Octa-core server
2.3. Query Transformation
Sql queries are issued on the grid data assuming that the interpolation is already done. We need to parse the sql
statements to perform interpolation and substitute the results in place of wherever the grid table is used.
For example:
select * from grid ,sensor where xval between 1 and 10 and yval between 1 and 10 and s_id=1
For this purpose we require a transformation tool which transforms the sql query, parsing the where clause for
arguments and replacing the grid table with the user defined functions with the parsed arguments as input.
Javacc is the most popular parser generator for java based platform. As it concentrates on parser generation in only
one language, it crops up less errors .It is easy to use and also contains functions to auto document the parser and to
generate parser given the AST(Abstract Syntax Tree).It also contained functions to dump the AST and to perform
specific function when specific nodes are encountered, which is exactly what we want.
The Javacc transforms the sql query into an AST with a set of nodes. We parse the node formed by the where clause
and extract the arguments for the user defined function from it. We output the rest of the where clause as is. Now we
parse the node with with the table name as grid and replace the node with user defined function with the parsed
arguments as input.
Output: select * from showgrid(1,1,10,10,1),sensor