Cloud Technologies for Computational Sciences (Sergey Berezin)

656 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
656
On SlideShare
0
From Embeds
0
Number of Embeds
155
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Cloud Technologies for Computational Sciences (Sergey Berezin)

  1. 1. Cloud Technologies for Computational Sciences Sergey Berezin Dmitry Grechka Moscow State University and Microsoft Research
  2. 2. Common tasks for computational scientists  Primary  Build models  Fit models with data  Simulate using fitted models  Validate and adjust models  Share results  Reproduce the results of others  non-primary  Find the useful data  Choose the dataset among available  Prepare the data for usage Example: Species occurrence probability related to temperature and precipitation 2
  3. 3. Fetch function Jan 2013 Jan 2012 Jan 2011 Jan 2010 Jan 2009 3
  4. 4. Defensible science Uncertainty σ reflects incomplete knowledge of the quantity • Noise standard deviation • Confidence interval • Credible interval Reproducibility – fetch function always returns same value v for same values of arguments Provenance – what type of source data d was used to compute the value 4
  5. 5. Data cube fetch 2013 2012 2011 2010 2009 5
  6. 6. Demo 1. FetchClimate web interface http://fetchclimate2.cloudapp.net 2. FetchClimate API http://jsfiddle.net/sergey77/swePD 6
  7. 7. Common Data Model (CDM)  Dataset is a set of constrained variables.  Variable is a annotated multidimensional array. prate lon lat t Dimension: lon Dimension:time http://www.unidata.ucar.edu/software/netcdf-java/CDM/ We use Dmitrov package as CDM layer http://research.microsoft.com/en-us/um/cambridge/groups/science/tools/dmitrov/ 7
  8. 8. Data sets variety  Time series and scattered points  Long-term averaged grid  Time series grids 8
  9. 9. Data sets variety Global Historical Climatology Network (GHCN v2) 21310 stations, monthly averages [6] time of type DateTime (time:3732) [5] id of type UInt64 (stations:21310) [4] lon of type Single (stations:21310) [3] lat of type Single (stations:21310) [2] prate of type Int32 (stations:21310) (time:3732) [1] temp of type Int32 (stations:21310) (time:3732) Peterson, Thomas C. and Russell S. Vose (1997). "An overview of the Global Historical Climatology Network temperature data base". Bulletin of the American Meteorological Society 78 (12): 2837–2849  Time series and scattered points  Long-term averaged grid  Time series grids 9
  10. 10. Data sets variety CRU CL 2.0, World Clim 1.4, … WorldClim 1.4 (~34.7 Gb) High spatial resolution (~ 1km resolution at equator) 50 years average, 12 separate months [5] time of type Int32 (time:12) [4] lat of type Single (lat:18000) [3] lon of type Single (lon:43200) [2] prec of type Int16 (time:12) (lat:18000) (lon:43200) [1] tmean of type Int16 (time:12) (lat:18000) (lon:43200) Hijmans, R.J., S.E. Cameron, J.L. Parra, P.G. Jones and A. Jarvis, 2005. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25: 1965-1978.  Time series and scattered points  Long-term averaged grid  Time series grids 10
  11. 11. Data sets variety NCEP/NCAR Reanalysis 1 High temporal resolution (6 hours average) ~200 km at equator [4] lat of type Single (lat:94) [3] lon of type Single (lon:192) [2] time of type Double (time:92044) [1] prate of type Int16 (lat:94) (lon:192) (time:92044) http://www.cpc.ncep.noaa.gov/products/wesley/reanalysis.html  Time series and scattered points  Long-term averaged grid  Time series grids 11
  12. 12. Fetch function logic 12
  13. 13. Uncertainty evaluation  Statistical approach For each dataset:  Find an external set of “reference values” at reference sites (considered to be exact)  Generate a sample of corresponding pairs: (computed using the dataset, “reference value”)  Discover the dependencies between the value difference and the spatiotemporal location of a reference site  Uncertainty propagation  Sequential usage of methods that consider uncertainty 13
  14. 14. Data handling  14
  15. 15. Chunked array storage Linear array storage 1 3 4 3 7 5 4 3 2 1 3 3 7 1 3 3 7 4 3 1 3 4 3 7 5 4 3 2 1 3 4 3 7 5 4 3 2 Chunked array storage (HDF5) 15
  16. 16. Choosing right part size 16
  17. 17. Md Array Storage for Azure DataSet Interface (a) 17
  18. 18. FetchClimate requests  Very different processing time: from second to hours  Low latency request scheduler  Node restart protection  Partitioning & round-robin scheduler  Generate large datasets  Repeatable requests  Server-side cache 18
  19. 19. FetchClimate top level diagram Frontend role Frontend role Frontend role Frontend role Frontend role Azureloadbalancer … … • Request hash • Status=Pending|Running| Completed|Failed • Submit time • Touch time • Part count/Total parts Worker role Worker role Worker role Worker role Worker role Worker role Worker role Worker role Worker role Worker role Worker role Worker role … Cache/Queue Azure blob storage IDataHandler Configuration database Azure chunked array storage 19
  20. 20. Data source handler interface IRequestContext { FetchRequest Request { get; } Task<Array> GetMaskAsync(Array uncertainty); DataStorageDefinition StorageDefinition { get; } Task<StorageResponse[]> GetDataAsync(StorageRequest[] requests); Task<FetchResponse[]> FetchDataAsync(FetchRequest[] requests); } abstract class DataSourceHandler { abstract Task<Array> ProcessRequestAsync(IRequestContext ctx); } 20
  21. 21. Partitioning  Splitting one big request into many small  Take advantage of parallel processing  Protect system from huge tasks  Who will join multiple partitions  Concurrent-safe datase  How to choose partition size? 21
  22. 22. Future work  Improving uncertainty handling  Automatic uncertainty generation  Improving request scheduler  Different uncertainty computation complexity  Choosing part size  Predicting processing time  Use off-the-shelf solution  Averaging large temporal-spatial regions  Data pyramid? 22
  23. 23. Questions? 23 sergey@mstlab.org dmitryg@mstlab.org

×