user_defined_functions_forinterpolation

User defined functions for performing interpolation
1. Introduction
Interpolation or kringing is the process of finding out the sensor data values at all points of a grid.Environmental
scientists record the datavalues for each timestamp at all points where the sensors are located.Using these values we
find out the data values at all the points on a specified grid.A grid consists of a two dimensional set of points at
equal intervals on a map. Such a two-dimensional set of points on a map help to capture the environmental state of
the database at a particular timestamp.
Grid is a set of two dimensional points equally spaced. The distance between consecutive points is decided by the
granularity of the grid. It is a two-dimensional evaluation space. The values of a particular measurement are
calculated at each point of the grid. These values at all points of the grid represent the environmental state at a
particular timestamp.
2. System Architecture
2.1. Data smoothing:
Done as sql queries inside the database
The data first needs to be smoothened. The smoothening of data is performed as follows:
1) A reference point is considered. The difference of the data timestamp and the reference timestamp is divided into
n minute intervals. Each timestamp is put into n minute intervals. Each timestamp is put into its own interval by
calculating the floor of its division with n.
For example consider a data timestamp 11.40 PM,31-08-2007.The time difference between this time stamp and the
reference timestamp is calculated and divided into n minute interval(Supposing the data needs to be smoothed into n
minute intervals)Each timestamp is put into its own interval by calculating the floor of its division with n.
For example, consider a data timestamp 11:35 PM,31-08007.The time difference between this timestamp and the
reference timestamp is calculates in minutes. This difference in minutes is divided by 30(supposing the smoothening
needs to be performed in 30 min intervals).The floor of this division is calculated and multiplied by 30.This result is
added to the reference timestamp.
This mathematical calculation returns the bucket for this timestamp which is 11.30 pm, 31-08-07.All the timestamps
lying between 11.30 and 12 pm of 31-08007 fall into this bucket. The average of data values of all such rows is
calculated and is considered as the cumulative value at this timestamp (11.30 pm, 31-08-2007).
Once the data has been smoothened with reference to the time-interval given, this data is used as input to the user
defined functions written.
2.2. User defined functions
General working--
The user defined functions first takes as input the maximum, minimum values of x and y to calculate the grid for
which environmental state needs to be calculated. For each such grid point, the effect of all the sensors which
measure the input measurement must be found out.
For measuring the effect of each such sensor on the specified grid point, we take into consideration two things:
1) The distance of each sensor from the given grid point.
2) The value of the measurement at that particular sensor (please note that all the values mentioned here are with
respect to a particular timestamp)
The effect of each sensor on the value at a particular timestamp is:
1) Inversely proportional to the distance of sensor from the grid point
2) Directly proportional to value of sensor.
Hence, the equation that we use is:
Where CE represents the cumulative effect on a single grid point
V1 ,…Vn represent the values at sensor points 1..n.
d1,…dn represent the distance metric between the sensor point and the actual .
Therefore, for each grid point, the sensors calculating the given measurement are taken. This data is then
smoothened as described before according to a specified interval.

The smoothed data is then interpolated for each interval using the formula above. This is calculated for every
interval.
Sql query generation:queries are passed by the user oblivious to the processing that is taking place wrt the query.
parsing and transforming sql queries:The queries passed by the user are parsed and a query tree is formed.The
arguments for the user defined functions are extracted from this parse tree and passed to the functions.The functions
replace the grid wherever it is used in the query.
Cursor based approach:
This is the slowest user defined function in terms of performance.
Input: maximum, minimum co-ordinates of grid and measurement id.
Output: interpolated environmental state.
A cursor returns the grid points lying between maximum, minimum points given as input.
The cursor then gives each row of its result set to point-influence, a function which calculates influence of sensors at
each point for a given measurement. Inside this function, a cursor returns the smoothened data for a particular
measurement. It also orders them by time.
For each timestamp all the sensor data is used in the formula above to calculate influence at a particular grid
point.
This is carried for each timestamp. This user defined function performs slowly because a cursor reads each value
from the database one after the other. There is no buffer which stores a group from which we can read easily. As
each call has to do database I/O,the function is the slowest of all the other functions.
Data structure based approach:
It is possible to write CLR enabled functions using .net languages to perform interpolation.CLR enabled
functions query the database and return a data reader class. The data reader acts as a buffer for the queried data.
From this data reader each value can be read one at a time and no overhead occurs on database I/O.
But there is a drawback to such an approach. The environmental data, which spans across various sensors all over
Switzerland and across time spread throughout 3 years, is quite huge taking several terabytes. Storing a subset of
such data in data structures in programming is not advisable. As the grid size increases due to decrease of
granularity, the data structures might not be able to hold such a magnitude of data. There is a substitute to such an
anomaly in case it occurs. This substitute is discussed in the next function.
Write-file function:
In this function, we query the database for the relevant grid points and the results are returned as a data reader class.
Along with this, the results of the data value for a particular measurement are also queried inside the database.
These two data readers are enough to calculate the interpolation values at a particular timestamp.
Then, the second data reader is opened and queried for the next timestamp to calculate the grid values at this
timestamp and so on.
In ms sql 2005,to open two result sets at a time you need to put MARS(Multiple Active Result Sets) option to true.
This option is not enabled for ms sql 2005 native client. But ms sql server 2005 is the version used in swissex.
The workaround for this is querying the db for the first timestamp and storing the results in a local store file, which
is equivalent to caching the results in a data reader.
Therefore we write the results for the first timestamp in a file. Then use the results to calculate interpolation by
reading the file. The results are stored in the file as comma separated values (CSV).Then the results for the next
timestamp are overwritten in the file and so on.
Table valued function:
This is the sql version of the user defined functions that we have represented till now. In this function a
single sql statement performs the interpolation. This sql statement is parsed by the server itself which then performs
the QEP (Query execution plan).This plan is not optimized, hence the execution is slower than write file method.
Non-CLR function:
This function queries the database using an sql connection and retrieves result set rows one after the other.The
advantage of this function is that we can have multiple active result sets open. The drawback is more i/o as we need
to retrieve results from database row wise. This function performs the best among all the stored procedures so
far,due to multiple active result sets that we have open.
Chunk Processing function:

The processing is done completely outside the database in the form of chunks of uniform size instead of chunks
dependent on data for a particular timestamp.The performance is a little bit worser than csv file based function. This
can be because,though we have decreased the disk I/O,the burden of processing in "chunks" of tuples rather than
chunks of timestamps causes the program to store previous data in a hash map so that when the next chunk of data
related to the same timestamp is processed,it can use the previous data.
The algorithm uses the size of the data or the value of the timestamp as a stopping point(whichever of these comes
first).So,if we are processing in chunks of data of 20 tuples for example,If the data contains 42 tuples,it gets
processed in set of 20->20->and 2 tuples.Therefore,there is not only the burden of storing data from previous
processing,but also a wastage of space(and hence time) while processing the remaining 2 tuples.
Gridcalculate
(Cursor based)
ShowGrid
(Joinbased)
WriteFile
(CSV file Based)
Non-clr chunkprocessi
ng
1000 34 min 4min 15 seconds 1 min 56 seconds 1 min 32 secs 2 min 29s
10,000 NA 53 min 41 seconds 19 min 33 seconds 15 min 40
secs
25min 14s
50,000 NA 4 hours 21 min 50 sec 1 hour 39 min 25 sec 1 hr 6 min
38seconds
2 hr 15min
29s
1,00,000 NA NA 3 hour 16 min 19 sec 2 hr 19mi 31 4 hr 29 min
Table 2:Performance comparison of user defined functions.
Note: Experiment was carried out on Octa-core server
2.3. Query Transformation
Sql queries are issued on the grid data assuming that the interpolation is already done. We need to parse the sql
statements to perform interpolation and substitute the results in place of wherever the grid table is used.
For example:
select * from grid ,sensor where xval between 1 and 10 and yval between 1 and 10 and s_id=1
For this purpose we require a transformation tool which transforms the sql query, parsing the where clause for
arguments and replacing the grid table with the user defined functions with the parsed arguments as input.
Javacc is the most popular parser generator for java based platform. As it concentrates on parser generation in only
one language, it crops up less errors .It is easy to use and also contains functions to auto document the parser and to
generate parser given the AST(Abstract Syntax Tree).It also contained functions to dump the AST and to perform
specific function when specific nodes are encountered, which is exactly what we want.
The Javacc transforms the sql query into an AST with a set of nodes. We parse the node formed by the where clause
and extract the arguments for the user defined function from it. We output the rest of the where clause as is. Now we
parse the node with with the table name as grid and replace the node with user defined function with the parsed
arguments as input.
Output: select * from showgrid(1,1,10,10,1),sensor

user_defined_functions_forinterpolation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to user_defined_functions_forinterpolation

Similar to user_defined_functions_forinterpolation (20)

user_defined_functions_forinterpolation