Successfully reported this slideshow.

PCIC Data Portal 2.0

354 views

Published on

Presentation to the staff of the Pacific Climate Impacts Consortium on 2014/02/18 about its Computational Support Group's work on version 2.0 of the PCIC Data Portal.

Published in: Software, Technology
  • Be the first to comment

  • Be the first to like this

PCIC Data Portal 2.0

  1. 1. Demos Architecture Bonus PCIC Data Portal 2.0 Staff Meeting James Hiebert February 18, 2014 James Hiebert PCIC Data Portal 2.0
  2. 2. Demos Architecture Bonus Outline 1 Demos 2 Architecture Metadata Database Python Backend Pydap ncWMS Basemaps Front-end 3 Bonus Automated Testing James Hiebert PCIC Data Portal 2.0
  3. 3. Outline 1 Demos 2 Architecture Metadata Database Python Backend Pydap ncWMS Basemaps Front-end 3 Bonus Automated Testing 2014-02-18 PCIC Data Portal 2.0 Outline 1. Last week we deployed our 4th and hopefully final release candidate for version 2.0 of the PCIC Data Portal. It’s been a four month beta period over which we have received and responded to feedback, both from inside PCIC and from some external beta testers. Many of you have seen these at the various theme meetings that we had throughout the fall, but I’d like to take this opportunity to both introduce the rest of you to the data portal as well as elaborate on more on what is running behind the scenes and all of the work that has gone into producing it. 2. Typically in these presentations, I hold you captive with all of the technical details first and save the demo for the end. But in this case, I’ll start with the demo and then if you don’t care about how we did it, you can just check out after that.
  4. 4. Demos Architecture Bonus Raster Portal(s) Coming soon! James Hiebert PCIC Data Portal 2.0
  5. 5. Raster Portal(s) Coming soon! 2014-02-18 PCIC Data Portal 2.0 Demos Raster Portal(s) 1. The software that we have written are a variety of components to generally handle the organization and presentation of raster data; that is gridded fields of spatiotemporal data. There are several sets of high value data, for which we have written a “raster portal” which can serve that data up.
  6. 6. Demos Architecture Bonus BCSD Downscale Canada James Hiebert PCIC Data Portal 2.0
  7. 7. BCSD Downscale Canada 2014-02-18 PCIC Data Portal 2.0 Demos BCSD Downscale Canada 1. You’ll see that the feature set is intentionally fairly sparse. The application’s purpose is to allow the users to get the data they want, and only the data they want, and then to send them on their way. The main section of screen real estate is the map. The map is for displaying the areas for which data exists and then to allow the user to select an area for which to download. 2. In the top right, there is a tree selection which controls the dataset that is displayed and that which will be downloaded. And finally there are a couple options for selecting a time range and data format. 3. We only support formats which support multidimensional data, which isn’t very many right now. We’ll be adding Arc ASCII Grid by the end of the fiscal year, which isn’t technically multidimensional, but we’ll probably send a zip file of individual grids, one per timestep.
  8. 8. Demos Architecture Bonus BC PRISM James Hiebert PCIC Data Portal 2.0
  9. 9. BC PRISM 2014-02-18 PCIC Data Portal 2.0 Demos BC PRISM 1. The BC PRISM portal is very similar to the BCSD Downscaling portal, with a few minor differences. First of all the map projection is specific to BC. We’ve used the BC Albers projection, which is a little more visually appealling (though it does present some challenges). Secondly, because the PRISM data only consists of monthly climatologies, the data volume in the temporal dimension is very small. For that reason, we elimintated the time subset controls, and chose just to give the user the entire time range.
  10. 10. Demos Architecture Bonus VIC (Generation 1) James Hiebert PCIC Data Portal 2.0
  11. 11. Demos Architecture Bonus Metadata Database Python Backend Pydap ncWMS Basemaps Front-end Software Components James Hiebert PCIC Data Portal 2.0
  12. 12. Software Components 2014-02-18 PCIC Data Portal 2.0 Architecture Software Components 1. One thing that you’ll notice from this diagram is that the data itself is at the foundation of this software stack. Without the data in place before hand, essentially nothing else can exist without it. Even the metadata in the database comes from the NetCDF files. This is why we have been somewhat militant about wanting your data to be finalized before we begin to work on the portal to publish it. 2. The NetCDF box here is the only thing that just data sitting on disk. These four boxes (PostgreSQL, ncWMS, pydap, pdp) are all different pieces of software running on the server which respond to incoming web requests. PostgreSQL organizes all of the metadata about the available data, ncWMS provides the climate visualization layers, pydap responds to requests for the actual data, and pdp responds to all of the requests that build up the user interface. [Do a page load showing the network tools]
  13. 13. Demos Architecture Bonus Metadata Database Python Backend Pydap ncWMS Basemaps Front-end Metadata Database James Hiebert PCIC Data Portal 2.0
  14. 14. Metadata Database 2014-02-18 PCIC Data Portal 2.0 Architecture Metadata Database Metadata Database 1. This might be a bit too much detail, but try to bear with me. This database stores the full relationship strucutre between all of the data files that we store and want to publish. It tracks all of the files on disk that we have, all of the different variables that they contain, full ranges for each variable so that we can quickly set color scales and such for the visualization layers. It stores all of the metadata about the files such as the timesteps that they contain, what their grid parameters are, what models they are from and how they relate to other driving models (for example in the case of an RCM forced by a GCM). All of these can be grouped into “ensembles” which is a group of rasters that we are publishing together on a single portal page. 2. The data contained in the schema allows the web application to function quickly, because everything is quickly searchable without opening up a bunch of files and having to read terrabytes of data just to determine a few key attributes.
  15. 15. Demos Architecture Bonus Metadata Database Python Backend Pydap ncWMS Basemaps Front-end Python Backend James Hiebert PCIC Data Portal 2.0
  16. 16. Python Backend 2014-02-18 PCIC Data Portal 2.0 Architecture Python Backend Python Backend 1. We have written a full web application backend in python which does all of the file format translation, all of the database communication and passes all of the metadata on to the webUI to be interpreted by the user. The application consists of about 2800 lines of python code plus 1500 lines of testing code that we have written outright. There’s about another 3000 lines of code which makes up PyDAP which we have heavily modified.
  17. 17. Demos Architecture Bonus Metadata Database Python Backend Pydap ncWMS Basemaps Front-end Python Backend 1 ensemble_name = ’bc_prism ’ portal_config = { ’title ’: ’High -Resolution Climatology ’, ’ensemble_name ’: ensemble_name , ’js_files ’ : wrap_mini ([ 6 ’js/ prism_demo_map .js’, ’js/ prism_demo_controls .js’, ’js/ prism_demo_app .js’], basename=’bc_prism ’, debug=True) } 11 portal_config = updateConfig (global_config , portal_config ) map_app = wrap_auth(MapApp (** portal_config ), required=False) dsn = dsn + ’? application_name =pdp_prism ’ with session_scope (dsn) as sesh: 16 conf = db_raster_configurator (sesh , "Download Data", 0.1, 0, ensemble_name , root_url= global_config [’app_root ’]. rstrip(’/’) + ’/’ + ensemble_name + ’/data/’ ) data_server = wrap_auth( RasterServer (dsn , conf )) 21 catalog_server = RasterCatalog (dsn , conf) #No Auth menu = PrismEnsembleLister (dsn) portal = PathDispatcher ([ (’^/map /?.*$’, map_app), (’^/ catalog /.*$’, catalog_server ), 26 (’^/ data /.*$’, data_server ), (’^/ menu.json .*$’, menu) ]) James Hiebert PCIC Data Portal 2.0
  18. 18. Demos Architecture Bonus Metadata Database Python Backend Pydap ncWMS Basemaps Front-end OPeNDAP and PyDAP Designed to be a: “discipline-neutral means of requesting and providing data across the [web]” Data Access Protocol (DAP) Open source Machine-to-machine transfer of scientific data Mostly supported by US scientific agencies (NOAA, NASA, NSF) James Hiebert PCIC Data Portal 2.0
  19. 19. OPeNDAP and PyDAP Designed to be a: “discipline-neutral means of requesting and providing data across the [web]” Data Access Protocol (DAP) Open source Machine-to-machine transfer of scientific data Mostly supported by US scientific agencies (NOAA, NASA, NSF) 2014-02-18 PCIC Data Portal 2.0 Architecture Pydap OPeNDAP and PyDAP 1. PyDAP is the component of the data portal that actually provides the data download services. It’s an implementation of the OPeNDAP protocol which is designed to be a discipline neutral means of transferring data across the web. This protocol is open source and is designed to OS and application independent such that you can get data into whatever software you want to use to do your data analysis. It’s supported by mostly US scientific agencies such as NOAA, NASA and the National Science Foundation.
  20. 20. Demos Architecture Bonus Metadata Database Python Backend Pydap ncWMS Basemaps Front-end OPeNDAP and PyDAP James Hiebert PCIC Data Portal 2.0
  21. 21. OPeNDAP and PyDAP 2014-02-18 PCIC Data Portal 2.0 Architecture Pydap OPeNDAP and PyDAP 1. There are a number of different OPenDAP servers out there, but PyDAP is the one that we use to serve all of the data itself. Its architecture is quite a bit more flexible than some of the other OpenDAP servers out there. This is a rough layout of the architecture. It has a number of “handlers” which are written to interpret different data formats and translate them to the DAP structure. Then on the top end, there are numerous “responders” that translate the DAP structure into output formats that the user wants. 2. [describe more specifically which parts are our and which we use]
  22. 22. Demos Architecture Bonus Metadata Database Python Backend Pydap ncWMS Basemaps Front-end How much of Pydap is our code?a a Source: hg churn pydap.handlers.pcic: 100% pydap.handlers.hdf5: 68.0% pydap.responses.netcdf: 61.5% pydap.handlers.sql: 12.3% pydap.handlers.csv: 3.7% pydap: 2.3% pydap.responses.xls: 1.3% pydap.responses.html: ? James Hiebert PCIC Data Portal 2.0
  23. 23. How much of Pydap is our code?a a Source: hg churn pydap.handlers.pcic: 100% pydap.handlers.hdf5: 68.0% pydap.responses.netcdf: 61.5% pydap.handlers.sql: 12.3% pydap.handlers.csv: 3.7% pydap: 2.3% pydap.responses.xls: 1.3% pydap.responses.html: ? 2014-02-18 PCIC Data Portal 2.0 Architecture Pydap How much of Pydap is our code?a a Source: hg churn 1. To give you a bit of an idea of to what degree Pydap was “off-the-shelf”, I ran the command “hg churn” on all of the pydap repositories, which measures the changes in the repository by lines of code. The fractions shown are the churn of PCIC staff divided by the total churn of all committers. You can see that we wrote one handler by ourselves, the hdf and netcdf work is mostly ours, and for the rest of the modules we only had to make minimal changes.
  24. 24. Demos Architecture Bonus Metadata Database Python Backend Pydap ncWMS Basemaps Front-end Big data, big RAM, BadRequest, Oh My! James Hiebert PCIC Data Portal 2.0
  25. 25. Big data, big RAM, BadRequest, Oh My! 2014-02-18 PCIC Data Portal 2.0 Architecture Pydap Big data, big RAM, BadRequest, Oh My! 1. One of the technical problems that we ran up against was that all of the available OPeNDAP data servers load their responses entirely into RAM before sending them out. So if you want to serve up large data sets, the size of your response is limited by your available RAM divided by the number of concurrent responses that you are prepared to serve. If you try and make a request to, say, THREDDS OPeNDAP server that’s larger than the JVM allocated memory, the user will just get back a BadRequest error. 2. For some applications this may be fine, or even desirable, but for the purposes of serving large data sets, the network pipe is usually the bottleneck. Rather than annoy and frustrate the user by forcing them to carve up their data requests to be arbitrarily small, we wanted to allow as large a request as the users were prepared to accept.
  26. 26. Demos Architecture Bonus Metadata Database Python Backend Pydap ncWMS Basemaps Front-end Generators: 70’s tech that works today! a function which yields execution rather than returning yields values one at a time, on-demand low memory footprint faster; no calling overhead elegant! James Hiebert PCIC Data Portal 2.0
  27. 27. Generators: 70’s tech that works today! a function which yields execution rather than returning yields values one at a time, on-demand low memory footprint faster; no calling overhead elegant! 2014-02-18 PCIC Data Portal 2.0 Architecture Pydap Generators: 70’s tech that works today! 1. Enter generators and coroutines. Generators are a programming control where a function, rather than returning, can yield execution and sort of return values one at a time on-demand. It has the performance advantage of maintaining a low memory footprint, if you want to return something large, you don’t have to do so all at once, and they tend to be slightly faster, because you avoid a lot of calling overhead of stack manipulation. 2. Generators have been around for a good thirty-five years, but have been experiencing a bit of a Renaissance lately. If one programs in python, they are extremely easy to use, and with the advent of big data applications, they have a lot of utility.
  28. 28. Demos Architecture Bonus Metadata Database Python Backend Pydap ncWMS Basemaps Front-end Generator Example from i t e r t o o l s import i s l i c e def f i b o n a c c i ( ) : a , b = 0 , 1 while True : y i e l d a a , b = b , a+b # p r i n t the f i r s t 10 v a l u e s of the f i b o n a c c i sequen for x in i s l i c e ( f i b o n a c c i () , 10): print x James Hiebert PCIC Data Portal 2.0
  29. 29. Generator Example from i t e r t o o l s import i s l i c e def f i b o n a c c i ( ) : a , b = 0 , 1 while True : y i e l d a a , b = b , a+b # p r i n t the f i r s t 10 v a l u e s of the f i b o n a c c i sequen for x in i s l i c e ( f i b o n a c c i () , 10): print x 2014-02-18 PCIC Data Portal 2.0 Architecture Pydap Generator Example 1. For those who aren’t familiar, here’s a quick example to understand generators. Generating a Fibonacci sequence is kind of the quintessential toy example. The generator function, fibonacci(), is defined at the top. You’ll notice that it’s an infinite loop, because the sequence is by definition, infinite. But rather than building up the values in memory, it just has a simple and elegant “yield” statement right inside the loop. The calling loop down below, actually pulls items from the function, one at a time, and then does whatever it needs to do with them. It’s fast, efficient, and actually fairly elegant, readable code, too. 2. So you can see, for something like a web application serving big datasets, this is perfect, because we can provide a very low latency response, and then stream the data to the user as our high-latency operations like disk reads take place. 3. None of the OPeNDAP servers out there supported streaming, so many of the modifications that we made to PyDAP were for it to
  30. 30. Demos Architecture Bonus Metadata Database Python Backend Pydap ncWMS Basemaps Front-end ncWMS Off-the-shelf Visualization of NetCDF rasters Full featured WMS server Limitations File-based layer configurations (tedious and error-prone!) Loads layers serially on startup (slow!) Scans layers for ranges (really slow!) James Hiebert PCIC Data Portal 2.0
  31. 31. ncWMS Off-the-shelf Visualization of NetCDF rasters Full featured WMS server Limitations File-based layer configurations (tedious and error-prone!) Loads layers serially on startup (slow!) Scans layers for ranges (really slow!) 2014-02-18 PCIC Data Portal 2.0 Architecture ncWMS ncWMS 1. We’re using a modified version ncWMS to provide visualization of the climate rasters. It gives us a lot of stuff for free. It’s a full featured Web Mapping Service server that converts netcdf files into tiled images usable on the web. [demo] 2. Unfortunately it has a few limitations that make it non-ideal for use with big data. To configure a layer, you have to go through the files, one-by-one and add them to the list and configure 5-10 different attributes. Additionally, when ever you start, re-start the server, it goes through every single file, in order, scans them to determine their ranges, so that it can assign a colorbar. This can take many minutes, possibly hours, and it only gets slower the more layers you add. 3. David Bronaugh has done some great work making modifications to ncWMS to run it off of our metadata database, so that it gets its list of layers from the database and all of the variable ranges and everything. This has made it possible to scale our deployment up
  32. 32. Demos Architecture Bonus Metadata Database Python Backend Pydap ncWMS Basemaps Front-end Mapnik and Basemaps Create our own basemaps from OpenStreetMap Maximum flexibility in domain and projection James Hiebert PCIC Data Portal 2.0
  33. 33. Mapnik and Basemaps Create our own basemaps from OpenStreetMap Maximum flexibility in domain and projection 2014-02-18 PCIC Data Portal 2.0 Architecture Basemaps Mapnik and Basemaps 1. A flat image of the climate rasters aren’t that useful, especially if you want to look at details in a particular locality. So thanks to some great work by Basil, we have our own web basemaps based on data from the OpenStreetMap project. We have the ability to generate our own basemaps in any projection that we want and for any domain. And we have control over the tile service so we can tweak it for maximum performance.
  34. 34. Demos Architecture Bonus Metadata Database Python Backend Pydap ncWMS Basemaps Front-end JavaScript Front-end 2600 lines of JavaScript Responsible for tying everything together for the web user Does little to no processing itself / just makes requests to various servers James Hiebert PCIC Data Portal 2.0
  35. 35. JavaScript Front-end 2600 lines of JavaScript Responsible for tying everything together for the web user Does little to no processing itself / just makes requests to various servers 2014-02-18 PCIC Data Portal 2.0 Architecture Front-end JavaScript Front-end 1. Finally, the last piece of the software stack is the JavaScript front-end that ties everything else together for the user. This is probably the most finicky and possibly most complex piece of the code base even though it doesn’t actually provide any functionality in and of itself. It has be be aware of all of the various services that are provided, it has to asyncronously make the requests, process them, display things to the user, and often the results of one request affect other things on the page. 2. [Show dataset selection, and how it is a request. Show how dataset selection triggers layer change the loading of layer attributes]. If any of these things fails, badness ensues.
  36. 36. Demos Architecture Bonus Automated Testing Automated Testing James Hiebert PCIC Data Portal 2.0
  37. 37. Automated Testing 2014-02-18 PCIC Data Portal 2.0 Bonus Automated Testing 1. In our two main repositories, we have about 1500 lines of code specifically for automated testing of the functionality of both the PCDS data portal and the raster portals. This test suite covers a large swath of the code base, but is also compact so we can run the full test suite in less than 5 seconds. This is fast enough that it can be intergrated directly into your development workflow and you can ensure that any changes you make to the code have not negatively and unintendedly affected any previously programmed functionality.
  38. 38. Demos Architecture Bonus Automated Testing Automated Testing Why? There’s a lot of code and many code paths. Manual testing is insane, takes days, and isn’t complete. Provides an “executable specification” for what the software should do Provides a way to ensure that code changes don’t affect existing functionality (a.k.a. regression testing) James Hiebert PCIC Data Portal 2.0
  39. 39. Automated Testing Why? There’s a lot of code and many code paths. Manual testing is insane, takes days, and isn’t complete. Provides an “executable specification” for what the software should do Provides a way to ensure that code changes don’t affect existing functionality (a.k.a. regression testing) 2014-02-18 PCIC Data Portal 2.0 Bonus Automated Testing Automated Testing 1. So with a system that provide this much functionality, there are a lot of different code paths through it, any of which could be taken for different user requests. It’s important to test as many of these as possible, every time you make changes in the system. To manually go through all of these–and we did with the release of the PCDS portal a year ago–is meticulous, time consuming and error prone. Automating this process pays off very quickly both in time and in code quality. 2. Additionally, the tests provide a sort of “executable specification”, declaring what the various pieces of the code are supposed to do. If a tests fails, your code doesn’t meet the spec. 3. Finally, the test suite provides a baseline against which further development cannot regress. It ensures that future changes will not negatively impact the functionality that we have previously developed. 4. [demo of pytest]
  40. 40. Demos Architecture Bonus Automated Testing Questions and hopefully answers James Hiebert PCIC Data Portal 2.0

×