Web delivery of giant climate data sets to
facilitate open science
James Hiebert
Pacific Climate Impacts Consortium
Univer...
Introduction
How to serve...
high-resolution,
downscaled,
climate data
on the web?
Pacific Climate Impacts Consortium
(PCIC)
● Regional climate services provider
● Non-profit hosted at the University of Vi...
Consortium Partners
Consortium Partners
Consortium Partners
Consortium Partners
Consortium Partners
Consortium Partners
High-resolution
How to serve...
high-resolution,
downscaled,
climate data
on the web?
Global Climate Models (GCMs)
IPCC AR4 Report, 2007
GCM Resolution
IPCC AR4 Report, 2007
Global to Regional Scale
Downscaling
How to serve...
high-resolution,
downscaled,
climate data
on the web?
Downscaling
GCMs
(100-500 km)
RCMs
(15-50 km)
Downscaling
(1-4 km)
Impacts Models
Moving on
How to serve...
high-resolution,
downscaled,
climate data
on the web?
Technical Challenges
● Data volume
● Irregular grids and projections
● Large dimensionality
Challenges: Data Volume
n emission scenarios
x
m GCMs
x
o RCMs
x
p downscaling methods
x
variables
x
space
x
time
Big data...
Challenges: Grids and Projections
Challenges: Large dimensionality
Demo
FOSS Components
Off the shelf
● PostgreSQL
● Python/h5py
● TileCache
● ncWMS
● OpenLayers
By PCIC (or heavily
modified)
● ...
FOSS Components
OpenStreetMap
OpenLayers
from
ncWMS
POST to
PyDAP
Big data, big RAM, BadRequest.
Oh my!
Hurry up and wait
Generators: 70's tech that works today!
● a function which yields execution rather than returning
● yields values one at a...
Generator Example
from itertools import islice
def fibonacci():
a, b = 0, 1
while True:
yield a
a, b = b, a+b
# print the ...
Back to the demo
Conclusions
● Governments use climate model output to proactively
plan adaptation strategies
● Climate model output is pre...
Acknowledgments and Questions
Thanks to:
● My small dev team
● PCIC's consortium members
● Roberto De Almeida (for PyDAP)
...
Upcoming SlideShare
Loading in …5
×

Web delivery of giant climate data sets to facilitate open science

309 views
255 views

Published on

The world's best climate scientists are busy churning out computed projections of our planet's future atmosphere and ocean. However, major barriers exist to publishing geospatiotemporal climate data in an open and transparent manner. This is primarily due to the sheer volume of information that these data represent. A handful of variables for a handful of model realizations by dozens of models for scores of scenarios over centuries of time on tens to hundreds of thousands of grid points create big data. Secondary challenges include the novelty of high-resolution climate data and the complexity and sometimes enigmatic nature of climate data interpretation. Finally, scientific teams often lack either the background, the resources, or the mandate required to facilitate strategic data sharing, aside from dropping their output files onto an FTP site.

In the spirit of open data and open science, the Pacific Climate Impacts Consortium (PCIC), a regional climate services provider in British Columbia, Canada, has been making a concerted effort to use geospatial FOSS in order to expand the availability, comprehensibility and transparency of big climate data sets from the Coupled Model Intercomparison Project (CMIP5) experiment. This presentation will describe some of the technical challenges to serving large geospatiotemporal climate data sets over the web, including some of the nuances of climate data. Additionally, I will explain the requirements of the geospatial climate community, and outline the FOSS solutions that we have employed to serve big climate data sets over the web.

Published in: Science, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
309
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • My name is James Hiebert from the Pacific Climate Impacts Consortium in British Columbia, and I'm going to speak about my experience putting high-resolution, downscaled, climate data on the web. Give me a few slides to unpack that statement back to front.
  • First of all I work for the Pacific Climate Impacts Consortium (PCIC), a regional climate service provider in British Columbia hosted at the University of Victoria. We are a publicly funded organization whose mandate is to conduct and facilitate regional climate research and deliver regional climate services.
  • As such we work directly with local and provincial governments, providing information on the physical science of climate change in support of the development of adaptation strategies.
  • Our stakeholders use our climate projections to answer a wide array of pertinent questions. Examples include whether future river flows can support hydro power...
  • whether future storm intensity will necessitate larger culverts, storm drains, or bridges...
  • to what degree sea level rise might inundate our homes and farmland (this photos is just a couple minute walk from my office)...
  • or whether our forests might become more susceptible to fire...
  • or outbreaks of disease.
    Our stakeholders are potentially making policy decisions and engineering decisions for incredibly expensive infrastructure based on the results of our impacts models. We have a responsibility to openly provide the results of each phase of the climate modeling pipeline. This provides a higher degree of transparency, and also facilitates a greater degree of stakeholder engagement with a tighter feedback loop. Think Agile development, but for science.
  • To answer these kinds of questions requires us to have information at a very small scale, which brings us to “high-resolution”. As applied to climate data, the term “high-resolution” is a little difficult to define. For impacts models, our users typically require landscape level information or better. So “high” is a relative term that usually means as high as you can get. But allow me a brief interlude to first explain where climate data comes from and how we get to landscape level information.
  • Modeling centers around the world develop and run what are called Global Climate Models or (GCMs). GCMs are derived from fundamental physical laws and include a variety of features that affect the climate system.
  • They run at relatively coarse scale, one pixel representing a few hundred kilometers square. The output from these GCMs can then be taken and transformed to a finer resolutions, sometimes using statistical relationships, sometimes by running the model again on a smaller domain.
  • This transformation step is a process referred to as downscaling. The result of this downscaling step is what is used to drive impacts models and that is our target data for distribution.
  • This is a very simplified flow diagram of the climate data pipeline, but to be clear, there is a lot of sophisticated physics, climate science, statistics, machine learning and other domains that go into each of these steps, each which have their own set of scientific questions to be answered and associated uncertainties.
  • So hopefully I have explained these parts well enough for you to get by and we can move on to putting the data on the web.
  • It turns out that it's hard to do! There are a variety of technical challenges that are worth mentioning.
  • If you were paying attention though that last slide, you may have thought, “hey, that sounds like a lot of moving parts”. If you consider that climate scientists are modeling some number of future scenarios for human emissions, multiply that by all of the different global climate models, by all of the different regional climate models, by all of the different types of downscaling, by some number of measured quantities, by time, by space... all of a sudden we're talking about a lot of data.
    You see, unlike observations, we're not limited by the sensors or satellites that we can afford to deploy. We can just create data out of thin air... or at least out of a model that is churning away at simulating the earth and the ocean.
    We're not even limited by what has happened in the past. We are projecting the future. And not just one future, but many, many realizations of the future. As modeled by many people and many modeling centers.
  • Additional challenges include that model grids are often irregular, curvilinear or in projections that aren't supported by FOSS that we want to use. And in simulating global climate, the poles and high latitudes are often very important to the climate system. But those same places are often overlooked by popular, well-supported projections like web mercator since few people lives there.
  • Also, the geospatial component of the data is not the only component. Maps are usually what we use in this community as our final outputs, but with all the different permutations of our data, this would end up being tens of thousands of maps, or billions if you consider the full time dimension.
  • With those challenges in mind, let's get back to the web and I'm going to go though a demo of what we developed to accommodate data delivery to our users.
    I'll go through this demonstration as hypothetical user of our climate services. Thisl user, Alice, is a engineer with BC's Ministry of Transportation. She is working in one of BC's remote coastal communities--attached to the outside world by one ferry and a single road—and is assessing its vulnerability to extreme precipitation and its effects on roads, culverts, bridges, and other critical infrastructure.
    Alice wants plausible future climate scenarios for the watersheds around the highway. We give the user a map to select their region of interest with an overlay of the climate raster to see where information is available.
    On the right hand side, we give our user a tree of various scenarios to select that correspond to different greenhouse gas concentrations pathways, different sets of GCMs and different downscaling methods.
    There's a time selector in case they're only interested in a subset of the future. Perhaps they're only desire an analysis of the past plus the future forty years to correspond to the projected lifespan of their bridge.
    Then the user just has to select an output format and hit download, and the data starts streaming.
    Now, the datasets that a user could download are potentially very large. Each scenario of the full spatiotemporal domain is around 150 GB. Which isn't ridiculous to serve up as a static file over HTTP. However, as soon as you want to write a web app around it, allow dynamic responses, subset request, etc., all of a sudden you have a lot of I/O bound computations to make before your HTTP reseponse gose back, which, ideally should happen in less than once second. So, while we're waiting for the data to download, I'll go through the FOSS components that we used off the shelf and some of the more “creative” things that we had to do to make all this work.
  • To give a sense of the breakdown between how much plug-and-play we were able to do versus how much custom development work we did, here is a comparison. Everything in the left column, we have been able to utilize out of the box to meet our needs. On the right you'll see a couple of python packages to which we made heavy modifications and then there's our data management schemas and the user interface.
  • Placing those components on the UI... the front-end is obviously OpenLayers, with a basemap based on OpenStreetMap data rendered by Mapnik, served through TileCache. The climate overlays layer is being served through ncWMS, an aptly named program that takes NetCDF input files and serves them as WMS layers. The data service layer is an OPeNDAP server named PyDAP which has been heavily modified by us and will eventually be released as open source.
  • One of the technical problems that we ran up against was that all of the available OPeNDAP data servers load their responses entirely into RAM before sending them out. So if you want to serve up large data sets, the size of your response is limited by your available RAM divided by the number of concurrent responses that you are prepared to serve. If you try and make a request to, say, THREDDS OPeNDAP server that's larger than the JVM allocated memory, the user will just get back a BadRequest error.
    For some applications this may be fine, or even desirable, but for the purposes of serving large data sets, the network pipe is usually the bottleneck. Rather than annoy and frustrate the user by forcing them to carve up their data requests to be arbitrarily small, we wanted to allow as large a request as the users were prepared to accept.
  • One of my big pet peeves is when you go to a website to get data, you hone in on what you want, the anticipation builds, you're ready for your data and then, you get some dialog box suggesting that
    “We'll e-mail you when your data is ready!” It's the biggest let down to focus in on this dataset that you want and need and then to have to turn your attention to something else while the server puts you, the user, on indefinite hold until it's ready.
    I don't intend any disrespect to these websites which I have shown, but at PCIC, one of our key requirements was to avoid this tragic UX mistake and give the user an instant response and begin streaming the data immediately.
  • Enter generators and coroutines. Generators are a programming control where a function, rather than returning, can yield execution and sort of return values one at a time on-demand. It has the performance advantage of maintaining a low memory footprint, if you want to return something large, you don't have to do so all at once, and they tend to be slightly faster, because you avoid a lot of calling overhead of stack manipulation.
    Generators have been around for a good thirty-five years, but have been experiencing a bit of a Renaissance lately. If one programs in python, they are extremely easy to use, and with the advent of big data applications, they have a lot of utility.
  • For those who aren't familiar, here's a quick example to understand generators. Generating a Fibonacci sequence is kind of the quintessential toy example. The generator function, fibonacci(), is defined at the top. You'll notice that it's an infinite loop, because the sequence is by definition, infinite. But rather than building up the values in memory, it just has a simple and elegant “yield” statement right inside the loop. The calling loop down below, actually pulls items from the function, one at a time, and then does whatever it needs to do with them. It's fast, efficient, and actually fairly elegant, readable code, too.
    So you can see, for something like a web application serving big datasets, this is perfect, because we can provide a very low latency response, and then stream the data to the user as our high-latency operations like disk reads take place.
    None of the OPeNDAP servers out there supported streaming, so our development team has made substantial additions to the pydap OPeNDAP server, to utilize generators for server large datasets. We hope to open source these changes later in the year. All of the code will be available soon enough, so I won't go into it, but know that generators were the key enabling technology.
  • With the generators discussion behind us, let's turn back to our demo. We're probably done downloading the data, and it looks like we are.
    Our engineer has now quickly downloaded gigabytes of custom-selected, climate scenario output with just a few clicks. The data is fully attributed with metadata, with units, with references and citations to the methods used to perform the downscaling. In the spirit of open science, having all of the metadata directly attached to the data is actually a pretty big deal, because it ensures data provenance is trackable even if further operations are performed on the data later on, which is highly likely. From here, it's relatively easy for her to plug the numbers into whatever impacts model she wants to run.
  • With that, I'll leave you with my simple conclusions. Governments use our downscaled climate model output to plan for the effects of climate change on their infrastructure. There's so much model output, that data delivery is a non-trival problem. We've tried to make it as easy as possible for our users to narrow the data down to what they actually need and we stream it to them right away. And hopefully later this year, we'll be able to open source our work and make it available to the community.
  • Finally, I'll close by thanking all of you FOSS developers out there, because without this community, none of this would have been at all possible. I'll be happy to answer any questions at this time.
  • Web delivery of giant climate data sets to facilitate open science

    1. 1. Web delivery of giant climate data sets to facilitate open science James Hiebert Pacific Climate Impacts Consortium University of Victoria Victoria, BC Canada FOSS4G NA May 22, 2013
    2. 2. Introduction How to serve... high-resolution, downscaled, climate data on the web?
    3. 3. Pacific Climate Impacts Consortium (PCIC) ● Regional climate services provider ● Non-profit hosted at the University of Victoria ● Fill a niche between pure research and applied science
    4. 4. Consortium Partners
    5. 5. Consortium Partners
    6. 6. Consortium Partners
    7. 7. Consortium Partners
    8. 8. Consortium Partners
    9. 9. Consortium Partners
    10. 10. High-resolution How to serve... high-resolution, downscaled, climate data on the web?
    11. 11. Global Climate Models (GCMs) IPCC AR4 Report, 2007
    12. 12. GCM Resolution IPCC AR4 Report, 2007
    13. 13. Global to Regional Scale
    14. 14. Downscaling How to serve... high-resolution, downscaled, climate data on the web?
    15. 15. Downscaling GCMs (100-500 km) RCMs (15-50 km) Downscaling (1-4 km) Impacts Models
    16. 16. Moving on How to serve... high-resolution, downscaled, climate data on the web?
    17. 17. Technical Challenges ● Data volume ● Irregular grids and projections ● Large dimensionality
    18. 18. Challenges: Data Volume n emission scenarios x m GCMs x o RCMs x p downscaling methods x variables x space x time Big dataBig data
    19. 19. Challenges: Grids and Projections
    20. 20. Challenges: Large dimensionality
    21. 21. Demo
    22. 22. FOSS Components Off the shelf ● PostgreSQL ● Python/h5py ● TileCache ● ncWMS ● OpenLayers By PCIC (or heavily modified) ● Pydap-3.2 ● Python/pupynere ● The metadata database indexing and management ● The whole UI
    23. 23. FOSS Components OpenStreetMap OpenLayers from ncWMS POST to PyDAP
    24. 24. Big data, big RAM, BadRequest. Oh my!
    25. 25. Hurry up and wait
    26. 26. Generators: 70's tech that works today! ● a function which yields execution rather than returning ● yields values one at a time, on-demand ● low memory footprint ● faster; calling overhead ● elegant!
    27. 27. Generator Example from itertools import islice def fibonacci(): a, b = 0, 1 while True: yield a a, b = b, a+b # print the first 10 values of the fobonacci sequence for x in islice(fibonacci(), 10): print x
    28. 28. Back to the demo
    29. 29. Conclusions ● Governments use climate model output to proactively plan adaptation strategies ● Climate model output is pretty big ● Don't make the user wait for their data ● Generators are awesome ● Our work should be available in PyDAP later this year
    30. 30. Acknowledgments and Questions Thanks to: ● My small dev team ● PCIC's consortium members ● Roberto De Almeida (for PyDAP) ● You! (OSS developers and supporters)

    ×