Astronomical Data Processing Using SciQL, an SQL Based Query Language for Array Data

1,013 views

Published on

This presentation was presented by Ying Zhang from CWI at The ADASS XXI on November 06-10, 2011 in Paris, France.

The ever growing use of high precision experimental instruments in astronomical projects, e.g., SDSS, LSST and LOFAR, amounts to an avalanche of data to be stored, curated and analysed. Ingestion of gigabytes and even terabytes of data on a daily basis is taking place in many projects, while planned experimental devices are expected to scale ingestion up to petabytes soon. Efficient data management as part of a data exploration infrastructure has become a discriminative factor for scientific progress. Relational database management systems (RDBMSs) are the prime means to fulfill the role of application mediator for data exchange and data persistence.

Nevertheless, scientific applications are still poorly served by contemporary RDBMSs. At best, the system provides a bridge towards an external library using user-defined functions, explicit import/export facilities or linked-in Java/C# interpreters. To bridge the gap between the needs of the data-intensive scientific research fields like astronomy and the current DBMS technologies, we introduce SciQL (pronounced as `cycle'), the first SQL-based query language for scientific applications with both tables and arrays as first class citizens. SciQL provides a seamless symbiosis of array-, set-, and sequence- interpretation. A key innovation is the extension of value-based grouping in SQL:2003 with structural grouping, i.e., fixed-sized and unbounded groups based on explicit relationships between the dimensional attributes of array cells. This leads to a generalization of window-based query processing with wide applicability in science domains.

In this talk, Ying Zhang demonstrates the usefulness of SciQL for astronomical data processing with examples from transient radio phenomena detection. The Transients Key Science Project (KSP) of the LOw Frequency ARray (LOFAR) focuses on exploring and understanding the explosive and dynamic universe by observing transient and variable radio sources. With its 2688 dipoles (LBA), 200 MHz sampling, 2 polarisations and 12 bit digitisation, the LOFAR antennas are capable of producing 138 petabyte of raw data per day. To process this massive volume of data, extraordinarily efficient query plans and algorithms are a must. The key operations in the Transient KSP project include cross-catalogue correlation and radio pulsars detection. For traditional RDBMSs, they are extremely hard to express in SQL and optimise for query execution. With SciQL, however, such array data oriented operations can be expressed easily and concisely. Furthermore, by revealing the properties of array data, SciQL enables the potentials of the RDBMSs to better optimise query plans.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,013
On SlideShare
0
From Embeds
0
Number of Embeds
147
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Astronomical Data Processing Using SciQL, an SQL Based Query Language for Array Data

  1. 1. Astronomical Data Processing Using SciQL, an SQL Based Query Language for Array Data Ying Zhang, Bart Scheers, Martin Kersten, Milena Ivanova, Niels Nes CWI Amsterdam ADASS XXI, Nov. 06-10, 2011, Paris, France !"#$%&()*+,#-&$.#/(012#&+$#%3$%#,( 2.#(4&#$5()*+,#-&$".1(6&$& !"#$%&()&"#*+,-( ./0/123 4")*()5"%%,%*(*#-(( 6!7(8 9:7;;9
  2. 2. Why Not RDBMS? SQL is difficult No appropriate array denotations No functional complete operation set DBMSs are slow Too much overhead Size limitations (due to BLOB representations) Existing foreign files Scale ...2011-11-09 ADASS XXI 3
  3. 3. SciQL An array query language based on SQL:2003 To lower the entrance fee to RDBMSs Distinguish features (Kersten et al. AD ’11; Zhang et al. IDEAS2011): Arrays and tables as first class citizens of DBMSs Seamless integration of relational and array paradigms Named dimensions with constraints Flexible structure-based grouping LOFAR Transient Key Project use case2011-11-09 ADASS XXI 4
  4. 4. Array Definitions y null 3 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 null null 1 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 x 0 1 2 3 null CREATE ARRAY A1 ( x INT DIMENSION [0:1:4], y INT DIMENSION [0:1:4], v FLOAT DEFAULT 0.0);2011-11-09 ADASS XXI 5
  5. 5. Array Definitions y null 3 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 null null 1 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 x 0 1 2 3 null CREATE ARRAY A1 ( x INT DIMENSION [0:1:4], y INT DIMENSION [0:1:4], v FLOAT DEFAULT 0.0); dimensions, any scalar data type2011-11-09 ADASS XXI 5
  6. 6. Array Definitions y null 3 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 null null 1 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 x 0 1 2 3 null CREATE ARRAY A1 ( x INT DIMENSION [0:1:4], y INT DIMENSION [0:1:4], v FLOAT DEFAULT 0.0); dimensional range: [(start|∗) : (step|∗) : (stop|∗)]2011-11-09 ADASS XXI 6
  7. 7. Array Definitions y null 3 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 null null 1 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 x 0 1 2 3 null CREATE ARRAY A1 ( x INT DIMENSION [0:1:4], y INT DIMENSION [0:1:4], v FLOAT DEFAULT 0.0); cell values, any column data type2011-11-09 ADASS XXI 7
  8. 8. Array Tiling SELECT [x], [y], AVG(v) FROM A1 GROUP BY A1[x:x+2][y:y+2]; y null 3 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 null null 1 0.0 0.5 0.5 0.5 0 0.0 0.0 0.0 0.0 0 1 2 3 x null2011-11-09 ADASS XXI 8
  9. 9. Array Tiling SELECT [x], [y], AVG(v) FROM A1 GROUP BY A1[x:x+2][y:y+2]; y null 3 0.0 0.0 0.0 0.0 Anchor point: 2 0.0 0.0 0.0 0.0 A1[x][y] null null 1 0.0 0.5 0.5 0.5 0 0.0 0.0 0.0 0.0 0 1 2 3 x null2011-11-09 ADASS XXI 8
  10. 10. Array Tiling SELECT [x], [y], AVG(v) FROM A1 GROUP BY A1[x:x+2][y:y+2]; y null 3 0.0 0.0 0.0 0.0 Anchor point: 2 0.0 0.0 0.0 0.0 A1[x][y] null null 1 0.0 0.5 0.5 0.5 0 0.0 0.0 0.0 0.0 0 1 2 3 x null2011-11-09 ADASS XXI 8
  11. 11. Array Tiling SELECT [x], [y], AVG(v) FROM A1 GROUP BY A1[x:x+2][y:y+2]; y null 3 0.0 0.0 0.0 0.0 Anchor point: 2 0.0 0.0 0.0 0.0 A1[x][y] null null 1 0.0 0.5 0.5 0.5 0 0.0 0.0 0.0 0.0 0 1 2 3 x null2011-11-09 ADASS XXI 8
  12. 12. Array Tiling SELECT [x], [y], AVG(v) FROM A1 GROUP BY A1[x:x+2][y:y+2]; y null 3 0.0 0.0 0.0 0.0 Anchor point: 2 0.0 0.0 0.0 0.0 A1[x][y] null null 1 0.0 0.5 0.5 0.5 0 0.0 0.0 0.0 0.0 0 1 2 3 x null2011-11-09 ADASS XXI 8
  13. 13. Array Tiling SELECT [x], [y], AVG(v) FROM A1 GROUP BY A1[x:x+2][y:y+2]; y null 3 0.0 0.0 0.0 0.0 Anchor point: 2 0.0 0.0 0.0 0.0 A1[x][y] null null 1 0.0 0.5 0.5 0.5 0 0.0 0.0 0.0 0.0 0 1 2 3 x null2011-11-09 ADASS XXI 8
  14. 14. Array Tiling SELECT [x], [y], AVG(v) FROM A1 GROUP BY A1[x:x+2][y:y+2]; y null 3 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 null null 1 0.125 0.25 0.25 0.25 0 0.125 0.25 0.25 0.25 0 1 2 3 x null2011-11-09 ADASS XXI 9
  15. 15. LOFAR Catalogue ra DOUBLE, zone (Gray et al. 2006) frequency decl DOUBLE, ra_err DOUBLE, 90 decl_err DOUBLE, ... flux DOUBLE, ... ... ν4 2 ν3 1 ν2 V 0 U ν1 Q -1 I -2 t1 t2 t3 t4 ... time ... -90 0 1 2 3 ... 357 358 359 meridian CREATE ARRAY LOFARsrc ( zone INT DIMENSION[-90:1:91], mrdn INT DIMENSION[0:1:360], ts TIMESTAMP DIMENSION, freq INT DIMENSION[30:10:241], id INT DIMENSION[0:1:*], stks CHAR(1) DIMENSION CHECK(stks=`I OR stks=`Q OR stks=`U OR stks=`V), ra DOUBLE, decl DOUBLE, ra_err DOUBLE, decl_err DOUBLE, flux DOUBLE, ...);2011-11-09 ADASS XXI 10
  16. 16. LOFAR Use Case ra DOUBLE, zone (Gray et al. 2006) frequency decl DOUBLE, ra_err DOUBLE, 90 decl_err DOUBLE, ... flux DOUBLE, ... ... ν4 2 ν3 1 ν2 V 0 U ν1 Q -1 I -2 t1 t2 t3 t4 ... time ... -90 0 1 2 3 ... 357 358 359 meridian Similarity of the flux of a LOFAR source at frequencies 30 MHz and 200 MHz cross-correlation of two time series2011-11-09 ADASS XXI 11
  17. 17. Cross-Correlation idx 0 1 2 3 F val 4 3 6 2 idx 0 1 2 G val 1 5 7 idx -3 -2 -1 0 1 2 Cr val2011-11-09 ADASS XXI 12
  18. 18. Cross-Correlation Cr.idx = -3 idx 0 1 2 3 F F [3 : 4] val 4 3 6 2 idx 0 1 2 G [0 : 1] G val 1 5 7 idx -3 -2 -1 0 1 2 Cr val 22011-11-09 ADASS XXI 13
  19. 19. Cross-Correlation Cr.idx = -2 idx 0 1 2 3 F F [2 : 4] val 4 3 6 2 idx 0 1 2 G [0 : 2] G val 1 5 7 idx -3 -2 -1 0 1 2 Cr val 2 162011-11-09 ADASS XXI 14
  20. 20. Cross-Correlation Cr.idx = -1 idx 0 1 2 3 F F [1 : 4] val 4 3 6 2 idx 0 1 2 G [0 : 3] G val 1 5 7 idx -3 -2 -1 0 1 2 Cr val 2 16 472011-11-09 ADASS XXI 15
  21. 21. Cross-Correlation Cr.idx = 0 idx 0 1 2 3 F F [0 : 3] val 4 3 6 2 idx 0 1 2 G [0 : 3] G val 1 5 7 idx -3 -2 -1 0 1 2 Cr val 2 16 47 612011-11-09 ADASS XXI 16
  22. 22. Cross-Correlation Cr.idx = 1 idx 0 1 2 3 F F [0 : 2] val 4 3 6 2 idx 0 1 2 G [1 : 3] G val 1 5 7 idx -3 -2 -1 0 1 2 Cr val 2 16 47 61 412011-11-09 ADASS XXI 17
  23. 23. Cross-Correlation Cr.idx = 2 idx 0 1 2 3 F F [0 : 1] val 4 3 6 2 idx 0 1 2 G [2 : 3] G val 1 5 7 idx -3 -2 -1 0 1 2 Cr val 2 16 47 61 41 282011-11-09 ADASS XXI 18
  24. 24. LOFAR Use Case ra DOUBLE, zone (Gray et al. 2006) frequency decl DOUBLE, ra_err DOUBLE, 90 decl_err DOUBLE, ... flux DOUBLE, ... ... ν4 2 ν3 1 ν2 V 0 U ν1 Q -1 I -2 t1 t2 t3 t4 ... time ... -90 0 1 2 3 ... 357 358 359 meridian DECLARE fcnt INT, gcnt INT; SET fcnt = SELECT COUNT(*) FROM LOFARsrc[*][*][*][30][11][‘I’]; SET gcnt = SELECT COUNT(*) FROM LOFARsrc[*][*][*][200][11][‘I’]; CREATE ARRAY VIEW F (idx INT DIMENSION[0:1:fcnt], flux DOUBLE DEFAULT 0.0) AS SELECT flux FROM LOFARsrc[*][*][*][30][11][‘I’]; CREATE ARRAY VIEW G (idx INT DIMENSION[0:1:gcnt], val DOUBLE DEFAULT 0.0) AS SELECT flux FROM LOFARsrc[*][*][*][200][11][‘I’]; CREATE ARRAY CrCorr30_200 (idx INT DIMENSION[-fcnt+1:1:gcnt], val DOUBLE DEFAULT 0.0); INSERT INTO CrCorr SELECT SUM(F.flux * G.flux) FROM F, G, CrCorr30_200 AS C GROUP BY F[MAX(0, -C.idx) : MIN(fcnt, gcnt-C.idx)], G[MAX(0, C.idx) : MIN(gcnt, fcnt+C.idx)];2011-11-09 ADASS XXI 19
  25. 25. LOFAR Use Case ra DOUBLE, zone (Gray et al. 2006) frequency decl DOUBLE, ra_err DOUBLE, 90 decl_err DOUBLE, ... flux DOUBLE, ... ... ν4 2 ν3 1 ν2 V 0 U ν1 Q -1 I -2 t1 t2 t3 t4 ... time ... -90 0 1 2 3 ... 357 358 359 meridian retrieve the time series DECLARE fcnt INT, gcnt INT; SET fcnt = SELECT COUNT(*) FROM LOFARsrc[*][*][*][30][11][‘I’]; SET gcnt = SELECT COUNT(*) FROM LOFARsrc[*][*][*][200][11][‘I’]; CREATE ARRAY VIEW F (idx INT DIMENSION[0:1:fcnt], flux DOUBLE DEFAULT 0.0) AS SELECT flux FROM LOFARsrc[*][*][*][30][11][‘I’]; CREATE ARRAY VIEW G (idx INT DIMENSION[0:1:gcnt], val DOUBLE DEFAULT 0.0) AS SELECT flux FROM LOFARsrc[*][*][*][200][11][‘I’]; CREATE ARRAY CrCorr30_200 (idx INT DIMENSION[-fcnt+1:1:gcnt], val DOUBLE DEFAULT 0.0); INSERT INTO CrCorr SELECT SUM(F.flux * G.flux) FROM F, G, CrCorr30_200 AS C GROUP BY F[MAX(0, -C.idx) : MIN(fcnt, gcnt-C.idx)], G[MAX(0, C.idx) : MIN(gcnt, fcnt+C.idx)];2011-11-09 ADASS XXI 19
  26. 26. LOFAR Use Case ra DOUBLE, zone frequency decl DOUBLE, ra_err DOUBLE, 90 decl_err DOUBLE, ... flux DOUBLE, ... ... ν4 2 ν3 1 ν2 V 0 U ν1 Q -1 I -2 t1 t2 t3 t4 ... time ... -90 0 1 2 3 ... 357 358 359 meridian dynamic grouping for every iteration DECLARE fcnt INT, gcnt INT; SET fcnt = SELECT COUNT(*) FROM LOFARsrc[*][*][*][30][11][‘I’]; SET gcnt = SELECT COUNT(*) FROM LOFARsrc[*][*][*][200][11][‘I’]; CREATE ARRAY VIEW F (idx INT DIMENSION[0:1:fcnt], flux DOUBLE DEFAULT 0.0) AS SELECT flux FROM LOFARsrc[*][*][*][30][11][‘I’]; CREATE ARRAY VIEW G (idx INT DIMENSION[0:1:gcnt], val DOUBLE DEFAULT 0.0) AS SELECT flux FROM LOFARsrc[*][*][*][200][11][‘I’]; CREATE ARRAY CrCorr30_200 (idx INT DIMENSION[-fcnt+1:1:gcnt], val DOUBLE DEFAULT 0.0); INSERT INTO CrCorr SELECT SUM(F.flux * G.flux) FROM F, G, CrCorr30_200 AS C GROUP BY F[MAX(0, -C.idx) : MIN(fcnt, gcnt-C.idx)], G[MAX(0, C.idx) : MIN(gcnt, fcnt+C.idx)];2011-11-09 ADASS XXI 20
  27. 27. Conclusion SciQL: a novel query language for scientific data A symbiosis of relational and array paradigm Simplifies expression of complex scientific algorithms Leave optimisation to DBMS kernel Opens opportunities to enhance scientific data mining Under active implementation !"#$%&()*+,#-&$.#/(012#&+$#%3$%#,( www.scilens.org www.monetdb.org 2.#(4&#$5()*+,#-&$".1(6&$& !"#$%&()&"#*+,-( ./0/123 4")*()5"%%,%*(*#-(( 6!7(8 9:7;;92011-11-09 ADASS XXI 21

×