Sky Arrays - ArrayDB in action for Sky View Factor Computation
1. ArrayDB in action for Sky View
Factor computation
Andrea Pagani – KNMI DataLab
Luca Trani – KNMI R&D Seismology and Acoustics
Array Databases for Research Communities Workshop –
EUDAT Conference
Porto 22nd January 2018
SkyArrays
2. Agenda
Use case introduction
Why arrayDB
Old vs. New
Results analysis
Comparison
Lesson learned
Suggestions
Open issues/questions
Problems encountered
Conclusions
3. Use case description
Computation of the sky view factor at high resolution (1m grid) for the
entire Netherlands
Parentheses:
“The sky view factor (SVF) denotes the ratio between radiation received by a planar surface and that from
the entire hemispheric radiating environment and is calculated as the fraction of sky visible from the ground
up. SVF is a dimensionless value that ranges from 0 to 1. A SVF of 1 means that the sky is completely
visible, for example, in a flat terrain. When a location has buildings and trees, it will cause the SVF to
decrease proportionally.”
4. Sky view factor usage
4
Heath stress
Road temperature
Fog formation
5. Why ArrayDB
Initial solution:
● 1.5TB point cloud data of height of objects for the Netherlands in raw LAZ format
● 40000+ LAZ files (i.e., tiles) with geo metadata in the filename
● Computation to be performed in R (nice library for the purpose: horizon)
Issues:
● Gridding very memory intensive
● Keep tracking of geo locations for tiles by filename
● Logic to merge/subset tiles done ad hoc
● High memory requirement to process multiple tiles
Computation:
● Test on high parallel machine (24CPUs/128GB ram)
● Distributed on Amazon AWS EC2 (80CPUs)
6. Computation, the old way
6
LAZ files
master
slaves
Slave tasks:
• LAS load
• Rasterization
• Tile neighbors merge
• Call SVF computation
• Write result grid
grid files
Master tasks:
• Divide work
• Coordinate slaves
7. New workflow
Consult with an expert on populating the data cube
Pre-processing phase: off-line rasterization of LAZ files to geoTiff
Initial ingestion phase
Computation nodes querying the system
Web coverage service to retrieve the data
Query with subsetting for a 2km x 2km region
Compute Sky View Factor in R
8. New setup
Resources provider by Dutch high performance
computing provider SurfSara via EUDAT project grant
1 machine with Rasdaman (4 cores, 16GB) arrayDB
installed
19 machines (2 cores each) for computation
1 machine (2 cores) acting as super peer for
computation and cluster coordination
New dataset: height of objects in geotiff raster
available via a government institution with resolution
at 0.5m
9. Computation, the new way
9
master
slaves
Slave tasks:
• Call WCS service
• Call SVF computation
• Write result grid
geoTiff files
Master tasks:
• Divide work
• Coordinate slaves
Rasdaman
server
Rasdaman tasks:
• Expose web service
• Interact with underlying DB
• Reply to queries
Net mounted
storage
Future plan:
write back SVF
in the arrayDB
11. Comparison
The new solution require an initial investment of understanding and installing the arrayDB technology
But it makes the interaction more standardized, easier, and less error prone
Cost benefit analysis: depends on use, need to share, and target user group
Old New
File based interaction Web-based query
Assemble and subsetting by ad hoc logic Initial ingestion and query based subsetting
Understanding georeferencing in files (names) Ingestion recipe
Raster from raw data on the fly Pre-processed raster
Data access via distributed file system Data transferred via http response to query
Custom coding in R Installation of arrayDB
12. Lessons learned
+
ArrayDB helps make your life easier
Standardized access to data
Less error prone for subsetting/assembling
Flexible access to relevant data partition
-
Input data have to be perfect, not always in real life,
unfortunately
Still careful in the DB query when many processes ask
(distributed installation might solve this)
Condition: interaction with engineer is essential to share knowledge for a working solution
13. Suggestions
- Data preparation tools (e.g., to handle non perfect datasets)
- Documentation and support for users/engineers
- Collaborative tools to facilitate user/arrayDB engineer interaction
- Promote standard interfaces/APIs for e.g. avoiding technology lock-
in, fostering decoupling and software portability
14. Open issues/questions
● Reliability and scalability of the community version
● RRasdaman installation is cumbersome (from the typical R user
perspective)
● Query multi-layer
● Support for point data/gridding
● Query logic for heterogeneous data
15. Problems encountered
● Configuration RASDAMAN
rasmgr.conf change ip → localhost
Users configuration: petascope, rasadmin, rasuser
● Wcst_import tool
○ Default mosaic_map recipe problem importing files with different tile dimensions
and slightly imperfect overlapping
○ Alternative custom recipe → complex
● java.lang.RuntimeException: Deadline Exceeded
Catched an exception:
org.odmg.TransactionNotInProgressException: Could not execute OQL-Query:
no open transaction
16. Problems encountered
2018-01-05 14:23:12 ERROR::packer-ubuntu-16-1263--error writing
file /home/ubuntu/temp/58900_61100--564150_566350-temp.tiff error
Cannot create a RasterLayer object from this file. (file does not exist)
coverage:
http://10.100.253.10:8080/rasdaman/ows?service=WCS&version=2.0.
1&request=GetCoverage&coverageId=HeightCoverage&subset=X(5890
0,61100)&subset=Y(564150,566350)&format=image/tiff
The raster is somehow corrupted after the request
17. Problems encountered
2018-01-15 22:30:13 ERROR::packer-ubuntu-16-1194--response status
server 500 coverage
http://10.100.253.10:8080/rasdaman/ows?service=WCS&version=2.0.1&req
uest=GetCoverage&coverageId=HeightCoverage&subset=X(34900,37100)&s
ubset=Y(392150,394350)&format=image/tiff not available
2018-01-15 22:30:13 ERROR::packer-ubuntu-16-1207--response status
server 500 coverage
http://10.100.253.10:8080/rasdaman/ows?service=WCS&version=2.0.1&req
uest=GetCoverage&coverageId=HeightCoverage&subset=X(46900,49100)&s
ubset=Y(526150,528350)&format=image/tiff not available
Server unavailable, coverage done again later is OK
Workaround: repeat request after a timeout set in R code
17
18. Conclusions
Overall a positive experience
Other users can benefit from the ingested dataset (and the ingested
results, future work)
DataCube (platform standard-based) rather than ArrayDB (technology)
A hosted data cube as-a-service and ready for use would be of great
help for scientific communities (with many datasets pre-ingested)
Relief the burden of installations, setup, configuration etc
Shared added value services to save investments and effort across
communities
Accounting model to be discussed
-How are point data supported? Can be some gridding methods supported natively?
-Have several layers with different spatial and/or temporal resolution, querying the several layers for a specific point and time what does it returns? Is there a default logic? Can a specific logic be embedded?