SciQL, Bridging the Gap between Science and Relational DBMS

  • 78 views
Uploaded on

by Ying Zhang, Martin Kersten, Niels Nes …

by Ying Zhang, Martin Kersten, Niels Nes
CWI Amsterdam
Dec. 03, 2012

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
78
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
3
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. SciQL Bridging the Gap between Science and Relational DBMS Ying Zhang, Martin Kersten, Niels Nes CWI Amsterdam Dec. 03, 2012 Virtual(Observatory(Infrastructure( for(Earth(Observation(Data Project(acronym:( TELEIOS Grant(agreement(no:(( FP7(8 257662
  • 2. Who needs arrays anyway? Seismology – 1-D waveforms, 3-D spatial data Astronomy – temporal ordered rasters Climate simulation – temporal ordered grid Remote sensing – images of 2-D or higher Genomics – ordered DNA strings Forensics – images, strings, graphs Scientists love arrays: FITS, MSEED, NETCDF, HDF5… but also use: lists, tables, XML, ... 2012-12-03 Ying Zhang 2
  • 3. Arrays In DBMS Research issues already in the 80’s Algebraic frameworks OODB, multi-dimensional DBMS, Sequence DBMS, ... The Longhorn Array Database (S)RAM, AQL, AML, ... RasDaMan SQL language extension Store large arrays in chunks as BLOBs RasQL, AQuery, SRQL, … Array query (RasQL) optimisation on top of DBMS a notion of order Known to work up to 12 TBs! PostgreSQL 8.1 SQL:1999, SQL:2003 SciDB collection type, C-style arrays Array DBMS from scratch aggregation functions over arrays Overlapping chunks for parallel execution 2012-12-03 Ying Zhang 3
  • 4. What Not RDBMS? SQL is difficult Appropriate array denotations? Functional complete operation set? DBMSs are slow Too much overhead Size limitations (due to BLOB representations)? Existing foreign files? Scale ... 2012-12-03 Ying Zhang 4
  • 5. SciQL An array query language based on SQL:2003 Lower the entrance fee to RDBMSs Reveal optimisation opportunities Distinguish features (Kersten et al., AD’11; Zhang et al. IDEAS2011): Arrays and tables as first class citizens of DBMSs Seamless integration of relational and array paradigms Named dimensions with constraints Flexible structure-based grouping 2012-12-03 Ying Zhang 5
  • 6. Array Definitions y 3 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0 1 2 3 x CREATE ARRAY A1 ( x INT DIMENSION [0:1:4], y INT DIMENSION [0:1:4], v FLOAT DEFAULT 0.0); 2012-12-03 Ying Zhang 6
  • 7. Array Definitions y 3 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0 1 2 3 x CREATE ARRAY A1 ( x INT DIMENSION [0:1:4], y INT DIMENSION [0:1:4], v FLOAT DEFAULT 0.0); dimensions, any scalar data type 2012-12-03 Ying Zhang 6
  • 8. Array Definitions y 3 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0 1 2 3 x CREATE ARRAY A1 ( x INT DIMENSION [0:1:4], y INT DIMENSION [0:1:4], v FLOAT DEFAULT 0.0); dimensional range: [(start|∗) : (step|∗) : (stop|∗)] 2012-12-03 Ying Zhang 7
  • 9. Array Definitions y 3 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0 1 2 3 x CREATE ARRAY A1 ( x INT DIMENSION [0:1:4], y INT DIMENSION [0:1:4], v FLOAT DEFAULT 0.0); cell values, any column data type 2012-12-03 Ying Zhang 8
  • 10. Array Definitions y 3 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0 1 2 3 x CREATE ARRAY A1 ( x INT DIMENSION [0:1:4], y INT DIMENSION [0:1:4], v FLOAT DEFAULT 0.0); fixed array 2012-12-03 Ying Zhang 9
  • 11. Array Definitions Unbounded array CREATE ARRAY A2 ( x INT DIMENSION[1:*], y INT DIMENSION, v FLOAT DEFAULT 0.0); y 3 2 null 1 0 0 2012-12-03 1 2 Ying Zhang 3 x 10
  • 12. Array Definitions Unbounded array CREATE ARRAY A2 ( x INT DIMENSION[1:*], y INT DIMENSION, v FLOAT DEFAULT 0.0); INSERT INTO A2 VALUES (1,0,5.5), (1,1,0.4), (2,2,4.5); y 3 2 null 1 0 0 2012-12-03 1 2 Ying Zhang 3 x 10
  • 13. Array Definitions Unbounded array CREATE ARRAY A2 ( x INT DIMENSION[1:*], y INT DIMENSION, v FLOAT DEFAULT 0.0); INSERT INTO A2 VALUES (1,0,5.5), (1,1,0.4), (2,2,4.5); y null 3 2 1 0.0 null 4.5 0.4 0.0 0 5.5 0.0 1 2 0 2012-12-03 null Ying Zhang null 3 x 10
  • 14. Array Definitions Unbounded array CREATE ARRAY A2 ( x INT DIMENSION[1:*], y INT DIMENSION, v FLOAT DEFAULT 0.0); INSERT INTO A2 VALUES (1,0,5.5), (1,1,0.4), (2,2,4.5); current range y null 3 2 1 0.0 null 4.5 0.4 0.0 0 5.5 0.0 1 2 0 2012-12-03 null Ying Zhang null 3 x 10
  • 15. Array & Table Coercions SELECT x, y, v FROM A1; CREATE ARRAY A1 ( x INT DIMENSION[0:1:4], y INT DIMENSION[0:1:4], v FLOAT DEFAULT 0.0); y 3 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0 1 2 3 2012-12-03 x x 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 y 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 Ying Zhang v 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 full materialisation! 11
  • 16. Array & Table Coercions SELECT [x], [y], v FROM T2; CREATE TABLE T2 ( x INT, y INT, v FLOAT DEFAULT 0.0); INSERT INTO T2 VALUES (1,0,5.5), (1,1,0.4), (2,2,4.5), (1,1,1.3); x y v 1 0 5.5 1 1 0.4 2 2 4.5 1 1 1.3 dimension qualifiers: ‘[’, ‘]’ y null 3 2 1 0 0 4.5 0.4 0.0 5.5 null 0.0 0.0 1 2 null null 3 x An unbounded array dimension ranges derived from the minimal bounding box cells values from the table or the column default duplicates are overwritten arbitrarily 2012-12-03 Ying Zhang 12
  • 17. Array Modifications DELETE FROM A1 WHERE x = 1; y 3 0.0 null 0.0 0.0 2 0.0 null 0.0 0.0 1 0.0 null 0.0 0.0 0 0.0 null 0.0 0.0 0 1 2 3 x creates holes in the array 2012-12-03 Ying Zhang 13
  • 18. Array Modifications UPDATE A1 SET v = 0.5 WHERE y = 1; INSERT INTO A1 VALUES (0,1,0.5), (1,1,0.5), (2,1,0.5), (3,1,0.5); y 3 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 1 0.5 0.5 0.5 0.5 0 0.0 0.0 0.0 0.0 0 1 2 3 x set/change cell values overwrite existing values 2012-12-03 Ying Zhang 14
  • 19. Array Views CREATE VIEW A2 AS SELECT [x], [y], v FROM A1[1:3][1:3]; y 3 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 1 0.5 0.5 0.5 0.5 0 0.0 0.0 0.0 0.0 0 1 2 3 x array view A2 2012-12-03 Ying Zhang 15
  • 20. Array Views CREATE ARRAY VIEW A3 ( x INT DIMENSION [-1:1:5], y INT DIMENSION [-1:1:5], w FLOAT DEFAULT 0.0) AS SELECT [x], [y], v FROM A1; y 3 -1.0 -1.0 -1.0 -1.0 2 -1.0 -1.0 -1.0 -1.0 1 -1.0 0.5 0 -1.0 -1.0 -1.0 -1.0 0 2012-12-03 1 0.5 2 Ying Zhang 0.5 3 x 16
  • 21. Array Views CREATE ARRAY VIEW A3 ( x INT DIMENSION [-1:1:5], y INT DIMENSION [-1:1:5], w FLOAT DEFAULT 0.0) AS SELECT [x], [y], v FROM A1; y 4 3 0.0 -1.0 -1.0 -1.0 -1.0 0.0 2 0.0 -1.0 -1.0 -1.0 -1.0 0.0 1 0.0 -1.0 0.5 0 0.0 -1.0 -1.0 -1.0 -1.0 0.0 -1 0.0 0.0 0.0 0.0 0.0 0.0 -1 2012-12-03 0.0 0.0 0.0 0.0 0 1 2 3 4 0.5 Ying Zhang 0.0 0.5 0.0 0.0 x 17
  • 22. Array Tiling SELECT [x], [y], AVG(v) FROM A1 GROUP BY A1[x:x+2][y:y+2]; y 3 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 1 0.0 0.5 0.5 0.5 0 0.0 0.0 0.0 0.0 0 2012-12-03 0.0 1 2 3 Ying Zhang x 18
  • 23. Array Tiling SELECT [x], [y], AVG(v) FROM A1 GROUP BY A1[x:x+2][y:y+2]; y 3 0.0 0.0 2 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0 0.0 0.0 0.0 0.0 0 2012-12-03 0.0 1 Anchor point: A1[x][y] 0.0 1 2 3 Ying Zhang x 18
  • 24. Array Tiling SELECT [x], [y], AVG(v) FROM A1 GROUP BY A1[x:x+2][y:y+2]; y 3 0.0 0.0 2 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0 0.0 0.0 0.0 0.0 0 2012-12-03 0.0 1 Anchor point: A1[x][y] 0.0 1 2 3 Ying Zhang x 18
  • 25. Array Tiling SELECT [x], [y], AVG(v) FROM A1 GROUP BY A1[x:x+2][y:y+2]; y 3 0.0 0.0 2 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0 0.0 0.0 0.0 0.0 0 2012-12-03 0.0 1 Anchor point: A1[x][y] 0.0 1 2 3 Ying Zhang x 18
  • 26. Array Tiling SELECT [x], [y], AVG(v) FROM A1 GROUP BY A1[x:x+2][y:y+2]; y 3 0.0 0.0 2 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0 0.0 0.0 0.0 0.0 0 2012-12-03 0.0 1 Anchor point: A1[x][y] 0.0 1 2 3 Ying Zhang x 18
  • 27. Array Tiling SELECT [x], [y], AVG(v) FROM A1 GROUP BY A1[x:x+2][y:y+2]; y 3 0.0 0.0 2 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0 0.0 0.0 0.0 0.0 0 2012-12-03 0.0 1 Anchor point: A1[x][y] 0.0 1 2 3 Ying Zhang x 18
  • 28. Array Tiling SELECT [x], [y], AVG(v) FROM A1 GROUP BY A1[x:x+2][y:y+2]; y 3 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 1 0.125 0.25 0.25 0.25 0 0.125 0.25 0.25 0.25 0 2012-12-03 1 2 Ying Zhang 3 x 19
  • 29. Array Tiling SELECT [x], [y], AVG(v) FROM A1 GROUP BY A1[x:x+2][y:y+2] HAVING x mod 2 = 0 AND y mod 2 = 0; y 3 0.0 0.0 2 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0 0.0 0.0 0.0 0.0 0 2012-12-03 0.0 1 Anchor point: A1[x][y] 0.0 1 2 3 Ying Zhang x 20
  • 30. Array Tiling SELECT [x], [y], AVG(v) FROM A1 GROUP BY A1[x-1][y], A1[x][y-1], A1[x][y], A1[x+1][y], A1[x][y+1]; y 3 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 1 0.0 0.5 0.5 0.5 0 0.0 0.0 0.0 0.0 0 2012-12-03 0.0 1 2 3 Ying Zhang x 21
  • 31. the LOw Frequency ARray radio telescope
  • 32. LOFAR Catalogue ra DOUBLE, decl DOUBLE, ra_err DOUBLE, decl_err DOUBLE, flux DOUBLE, ... frequency zone (Gray et al. 2006) 90 ... ... ν4 2 ν3 1 0 ν2 -1 ν1 t1 -2 t2 t3 t4 ... I Q U V time ... -90 0 2012-12-03 1 2 3 ... 357 358 359 meridian Ying Zhang 23
  • 33. LOFAR Catalogue ra DOUBLE, decl DOUBLE, ra_err DOUBLE, decl_err DOUBLE, flux DOUBLE, ... frequency zone (Gray et al. 2006) 90 ... ... ν4 2 ν3 1 0 ν2 -1 ν1 t1 -2 t2 t3 t4 ... I Q U V time ... -90 0 1 2 3 ... 357 358 359 meridian CREATE zone ts id ARRAY LOFARsrc ( INT DIMENSION[-90:1:91], mrdn INT DIMENSION[0:1:360], TIMESTAMP DIMENSION, freq INT DIMENSION[30:10:241], INT DIMENSION[0:1:*], stks CHAR(1) DIMENSION CHECK(stks=`I' OR stks=`Q' OR stks=`U' OR stks=`V'), ra DOUBLE, decl DOUBLE, ra_err DOUBLE, decl_err DOUBLE, flux DOUBLE, ...); 2012-12-03 Ying Zhang 23
  • 34. LOFAR Use Case ra DOUBLE, decl DOUBLE, ra_err DOUBLE, decl_err DOUBLE, flux DOUBLE, ... frequency zone (Gray et al. 2006) 90 ... ... ν4 2 ν3 1 0 ν2 -1 ν1 t1 -2 t2 t3 t4 ... I Q U V time ... -90 0 1 2 3 ... 357 358 359 meridian Similarity of the flux of a LOFAR source at frequencies 30 MHz and 200 MHz cross-correlation of two time series 2012-12-03 Ying Zhang 24
  • 35. Cross-Correlation F idx 0 1 2 3 val 4 3 6 2 idx Cr 2012-12-03 idx -3 -2 1 2 val G 0 1 5 7 -1 0 1 2 val Ying Zhang 25
  • 36. Cross-Correlation Cr.idx = -3 idx 1 2 3 val F 0 4 3 6 2 idx 0 1 2 val 1 5 7 G Cr 2012-12-03 idx -3 val -2 -1 F [3 : 4] 0 1 G [0 : 1] 2 Ying Zhang 2 26
  • 37. Cross-Correlation Cr.idx = -2 idx 1 2 3 val F 0 4 3 6 2 idx 0 1 2 val 1 5 7 G Cr 2012-12-03 idx -3 -2 val 2 -1 0 F [2 : 4] 1 G [0 : 2] 16 Ying Zhang 2 27
  • 38. Cross-Correlation Cr.idx = -1 idx 1 2 3 val F 0 4 3 6 2 idx 0 1 2 val 1 5 7 G Cr 2012-12-03 idx -3 -2 -1 val 2 16 0 1 F [1 : 4] G [0 : 3] 47 Ying Zhang 2 28
  • 39. Cross-Correlation Cr.idx = 0 idx Cr 2012-12-03 2 3 4 3 6 2 idx 0 1 2 val G 1 val F 0 1 5 7 idx -3 -2 -1 0 val 2 16 47 1 F [0 : 3] G [0 : 3] 61 Ying Zhang 2 29
  • 40. Cross-Correlation Cr.idx = 1 idx 0 1 2 3 val 4 3 6 2 idx 0 1 2 val 1 5 7 F G Cr 2012-12-03 idx -3 -2 -1 0 1 val 2 16 47 61 F [0 : 2] G [1 : 3] 41 Ying Zhang 2 30
  • 41. Cross-Correlation Cr.idx = 2 idx 2012-12-03 2 3 4 3 6 2 idx 0 1 2 val G Cr 1 val F 0 1 5 F [0 : 1] 7 G [2 : 3] idx -3 -2 -1 0 1 2 val 2 16 47 61 41 28 Ying Zhang 31
  • 42. Cross-Correlation Cr.idx = 2 idx 2012-12-03 2 4 3 6 idx 0 1 2 val G Cr 1 val F 0 1 5 3 F [0 : 1] 2 7 G [2 : 3] idx -3 -2 -1 0 1 2 val 2 16 47 61 41 28 Ying Zhang 32
  • 43. LOFAR Use Case ra DOUBLE, decl DOUBLE, ra_err DOUBLE, decl_err DOUBLE, flux DOUBLE, ... frequency zone (Gray et al. 2006) 90 ... ... ν4 2 ν3 1 0 ν2 -1 ν1 t1 -2 t2 t3 t4 ... I Q U V time ... -90 0 1 2 3 ... 357 358 359 meridian DECLARE fcnt INT, gcnt INT; SET fcnt = SELECT COUNT(*) FROM LOFARsrc[*][*][*][30][11][‘I’]; SET gcnt = SELECT COUNT(*) FROM LOFARsrc[*][*][*][200][11][‘I’]; CREATE ARRAY VIEW F (idx INT DIMENSION[0:1:fcnt], flux DOUBLE DEFAULT 0.0) AS SELECT flux FROM LOFARsrc[*][*][*][30][11][‘I’]; CREATE ARRAY VIEW G (idx INT DIMENSION[0:1:gcnt], val DOUBLE DEFAULT 0.0) AS SELECT flux FROM LOFARsrc[*][*][*][200][11][‘I’]; CREATE ARRAY CrCorr30_200 (idx INT DIMENSION[-fcnt+1:1:gcnt], val DOUBLE DEFAULT 0.0); INSERT INTO CrCorr SELECT SUM(F.flux * G.flux) FROM F, G, CrCorr30_200 AS C GROUP BY F[MAX(0, -C.idx) : MIN(fcnt, gcnt-C.idx)], G[MAX(0, C.idx) : MIN(gcnt, fcnt+C.idx)]; 2012-12-03 Ying Zhang 33
  • 44. LOFAR Use Case ra DOUBLE, decl DOUBLE, ra_err DOUBLE, decl_err DOUBLE, flux DOUBLE, ... frequency zone (Gray et al. 2006) 90 ... ... ν4 2 ν3 1 0 ν2 -1 ν1 t1 -2 t2 t3 t4 ... I Q U V time ... -90 0 1 2 3 ... 357 358 359 meridian retrieve the time series DECLARE fcnt INT, gcnt INT; SET fcnt = SELECT COUNT(*) FROM LOFARsrc[*][*][*][30][11][‘I’]; SET gcnt = SELECT COUNT(*) FROM LOFARsrc[*][*][*][200][11][‘I’]; CREATE ARRAY VIEW F (idx INT DIMENSION[0:1:fcnt], flux DOUBLE DEFAULT 0.0) AS SELECT flux FROM LOFARsrc[*][*][*][30][11][‘I’]; CREATE ARRAY VIEW G (idx INT DIMENSION[0:1:gcnt], val DOUBLE DEFAULT 0.0) AS SELECT flux FROM LOFARsrc[*][*][*][200][11][‘I’]; CREATE ARRAY CrCorr30_200 (idx INT DIMENSION[-fcnt+1:1:gcnt], val DOUBLE DEFAULT 0.0); INSERT INTO CrCorr SELECT SUM(F.flux * G.flux) FROM F, G, CrCorr30_200 AS C GROUP BY F[MAX(0, -C.idx) : MIN(fcnt, gcnt-C.idx)], G[MAX(0, C.idx) : MIN(gcnt, fcnt+C.idx)]; 2012-12-03 Ying Zhang 33
  • 45. LOFAR Use Case ra DOUBLE, decl DOUBLE, ra_err DOUBLE, decl_err DOUBLE, flux DOUBLE, ... frequency zone 90 ... ... ν4 2 ν3 1 0 ν2 -1 ν1 t1 -2 t2 t3 t4 ... I Q U V time ... -90 0 1 2 3 ... 357 358 359 meridian dynamic grouping for every iteration DECLARE fcnt INT, gcnt INT; SET fcnt = SELECT COUNT(*) FROM LOFARsrc[*][*][*][30][11][‘I’]; SET gcnt = SELECT COUNT(*) FROM LOFARsrc[*][*][*][200][11][‘I’]; CREATE ARRAY VIEW F (idx INT DIMENSION[0:1:fcnt], flux DOUBLE DEFAULT 0.0) AS SELECT flux FROM LOFARsrc[*][*][*][30][11][‘I’]; CREATE ARRAY VIEW G (idx INT DIMENSION[0:1:gcnt], val DOUBLE DEFAULT 0.0) AS SELECT flux FROM LOFARsrc[*][*][*][200][11][‘I’]; CREATE ARRAY CrCorr30_200 (idx INT DIMENSION[-fcnt+1:1:gcnt], val DOUBLE DEFAULT 0.0); INSERT INTO CrCorr SELECT SUM(F.flux * G.flux) FROM F, G, CrCorr30_200 AS C GROUP BY F[MAX(0, -C.idx) : MIN(fcnt, gcnt-C.idx)], G[MAX(0, C.idx) : MIN(gcnt, fcnt+C.idx)]; 2012-12-03 Ying Zhang 34
  • 46. Discussion Your use cases? Your problems? What do you need? How can SciQL help? 2012-12-03 Ying Zhang 35