Unidata’s Common Data Model

John Caron
Unidata/UCAR
Nov 2006
Goals / Overview
• Look at the landscape of scientific
datasets from a few thousand feet up.
• What semantics are needed to make
these useful?
– georeferencing
– specialized subsetting
What’s a Data Model?
• An Abstract Data Model describes data objects
and what methods you can use on them.
• An API is the interface to the Data Model for a
specific programming language
• A file format is a way to persist the objects in
the Data Model.
• An Abstract Data Model removes the details of
any particular API and the persistence format.
Common Data Model Layers
Scientific Datatypes
Point

Trajectory
Radial

Grid

Station
Swath

Coordinate Systems

Data Access

Profile
Application

Scientific Datatypes
Datatype Adapter

NetCDF-Java
version 2.2
architecture

NetcdfDataset
ADDE

CoordSystem Builder
NetcdfFile

THREDDS
I/O service provider
OPeNDAP

Catalog.xml
NcML
NcML

NetCDF-3

NIDS

NetCDF-4

GRIB

HDF5

GINI
Nexrad
…

DMSP
NetCDF-4 and
Common Data Model
(Data Access Layer)
I/O Service Provider
Implementations
•
•
•
•
•
•

General: NetCDF, HDF5, OPeNDAP
Gridded: GRIB-1, GRIB-2
Radar: NEXRAD level 2 and 3, DORADE
Point: BUFR, ASCII
Satellite: DMSP, GINI
In development
– NOAA: GOES (Knapp/Nelson), many others
Coordinate Systems needed
• NetCDF, OPeNDAP, HDF data models do
not have integrated coordinate systems
– so georeferencing not part of API
– Need conventions to specify (eg CF-1,
COARDS, etc)

• Contrast GRIB, HDF-EOS, other
specialized formats
NetCDF Coordinate Variables
dimensions:
lat = 64;
lon = 128;
variables:
float lat(lat);
float lon(lon);
double temperature(lat,lon);
Coordinate Variables
– One-dimension variable with same
name as its dimension
– Strictly monotonic values
– No missing values
The coordinates of a point (i,j,k) is
{CV1(i), CV2(j), CV3(k)}
Limitations of 1D Coordinate Variables
• Non lat/lon horizontal grids:
float temperature(y,x)
float lat(y, x);
float lon(y, x);
• Trajectory data:
float NKoreaRadioactivity(pt);
float lat(pt);
float lon(pt);
float altitude(pt);
float time(pt)
General Coordinates in CF-1.0
float P(y,x);
P:coordinates = “lat lon”;
float lat(y, x);
float lon(y, x);
float Sr90(pt);
Sr90:coordinates
= “lat lon altitude time”;
Coordinate Systems (abstract)
• A Coordinate System for a data variable is
a set of Coordinate Variables2 such that the
coordinates of the (i,j,k) data point is
{CV1(i,j,k),CV2(i,j,k),CV3(i,j,k),CV4(i,j,k)…}
previous was {CV1(i), CV2(j), CV3(k)}

• The dimensions of each Coordinate
Variable must be a subset of the
dimensions of the data variable.
Need Coordinate Axis Types
float gridData(t,z,y,x);
float time(t);
float y(y);
float x(x);
float lat(y,x);
float lon(y,x);
float height(t,z,y,x);

float radialData(radial, gate)
float distance(gate)
float azimuth(radial)
float elevation(radial)
float time(radial)
The same??
float stationObs(pt);
float lat(pt);
float lon(pt);
float z(pt);
float time(pt);

float trajectory(pt);
float lat(pt);
float lon(pt);
float z(pt);
float time(pt);
Revised Coordinate Systems
1. Specify Coordinate Variables
2. Specify Coordinate Types
(time, lat, lon, projection x, y, height,
pressure, z, radial, azimuth, elevation)

3. Specify connectivity (implicit or
explicit) between data points
– Implicit: Neighbors in index space are
(connected) neighbors in coordinate
space. Allows efficient searching.
Gridded Data
float gridData(t,z,y,x);
float time(t); // Time
float y(y); // GeoX
float x(x); // GeoY
float z(t,z,y,x); // Height or Pressure
• Cartesian

coordinates
• All dimensions are connected

Connected means
Neighbors in index space
are neighbors in
coordinate space
Coordinate Systems UML
Scientific Data Types
• Based on datasets Unidata is familiar with
– APIs are evolving

• How are data points connected?
• Intended to scale to large, multifile
collections
• Intended to support “specialized queries”
– Space, Time

• Corresponding “standard” NetCDF file
conventions
Gridded Data
• Cartesian

coordinates
• All dimensions are connected
• x, y, z, time
• recently added runtime and ensemble
• refactored into GridDatatype interface
float gridData(t,z,y,x);
float time(t);
float y(y);
float x(x);
float lat(y,x);
float lon(y,x);
float height(t,z,y,x);
GridDatatype methods
CoordinateAxis getTaxis();
CoordinateAxis getXaxis();
CoordinateAxis getYaxis();
CoordinateAxis getZaxis();
Projection getProjection();
int[] findXYindexFromCoord( double x_coord,
double y_coord);
LatLonRect getLatLonBoundingBox();
Array getDataSlice (Range[] …)
GridDatatype makeSubset (Range[] …)
Radial Data
• Polar

coordinates
• All dimensions are connected
• Not separate time dimension
radialData(radial, gate) :
distance(gate)
azimuth(radial)
elevation(radial)
time(radial)
Swath
• lat/lon

coordinates
• not separate time dimension
• all dimensions are connected
swathData(line,cell)
lat(line,cell)
lon(line,cell)
time(line)
z(line,cell) ??
Point Observation Data
• Set

of measurements at the
same point in space and time
• Point dimension not connected
float obs1(pt);
float obs2(pt);
float lat(pt);
float lon(pt);
float z(pt);
float time(pt);
Structure {
lat, lon, z, time;
v1, v2, ...
} obs( pt);
PointObsDataset Methods
// Iterator<StructureData>
Iterator getData(
LatLonRect boundingBox,
Date start, Date end);
Time series Station Data
Structure {
name;
lat, lon, z;
Structure{
time;
v1, v2, ...
} obs(*); // connected
} stn(stn); // not connected
StationObs Methods
// List<Station>
List getStations(
LatLonRect boundingBox);
// Iterator<StructureData>
Iterator getData(
Station s,
Date start, Date end);
Trajectory Data
• pt dimension is connected
• Collection dimension not
connected
Structure {
lat, lon, z, time;
v1, v2, ...
} obs(pt); // connected
Structure {
name;
Structure {
lat, lon, z, time;
v1, v2, ...
} obs(*); // connected
} traj(traj) // not connected
Profiler/Sounding Station Data
Structure {
name;
lat, lon, time;
Structure {
z;
v1, v2, ...
} obs(*); // connected
} loc(nloc); // not connected
Structure {
name;
lat, lon;
Structure {
time,
Structure {
z;
v1, v2, ...
} obs(*); // connected
} time(*); // connected
} stn(stn); // not connected
Unstructured Grid
• Pt dimension not connected
• Looks the same as point data
• Need to specify the connectivity
explicitly
float unstructGrid(t,z,pt);
float lat(pt);
float lon(pt);
float time(t);
float height(z);
Data Types Summary
• Data access through a standard API
• Convenient georeferencing
• Specialized subsetting methods
– Efficiency for large datasets
Payoff
N + M instead of N * M things on your TODO List!
File Format
#1

CDM

Visualization
&Analysis

NetCDF file
File Format
#2
OpenDAP Server
File Format
#N

WCS Service

Web Service
THREDDS Data Server
HTTP Tomcat Server

Catalog.xml
THREDDS Server

•OPeNDAP
•HTTPServer
•WCS

NetCDF-Java
library

hostname.edu

Datasets

IDD Data

Application
Next: DataType Aggregation
•
•

Work at the CDM DataType level, know (some)
data semantics
Forecast Model Collection
–
–

•

Combine multiple model forecasts into single
dataset with two time dimensions
With NOAA/IOOS (Steve Hankin)

Point/Station/Trajectory/Profile Data
–
–

Allow space/time queries, return nested sequences
Start from / standardize “Dapper conventions”
Forecast
Model
Collections
Conclusion
• Standardized Data Access in good shape
– HDF5, NetCDF, OPeNDAP
– Write an IOSP for proprietary formats (Java)

• But that’s not good enough!
• To do:
– Standard representations of coordinate
systems
– Classifications of data types, standard
services for them

Unidata's Common Data Model

  • 1.
    Unidata’s Common DataModel John Caron Unidata/UCAR Nov 2006
  • 2.
    Goals / Overview •Look at the landscape of scientific datasets from a few thousand feet up. • What semantics are needed to make these useful? – georeferencing – specialized subsetting
  • 3.
    What’s a DataModel? • An Abstract Data Model describes data objects and what methods you can use on them. • An API is the interface to the Data Model for a specific programming language • A file format is a way to persist the objects in the Data Model. • An Abstract Data Model removes the details of any particular API and the persistence format.
  • 4.
    Common Data ModelLayers Scientific Datatypes Point Trajectory Radial Grid Station Swath Coordinate Systems Data Access Profile
  • 5.
    Application Scientific Datatypes Datatype Adapter NetCDF-Java version2.2 architecture NetcdfDataset ADDE CoordSystem Builder NetcdfFile THREDDS I/O service provider OPeNDAP Catalog.xml NcML NcML NetCDF-3 NIDS NetCDF-4 GRIB HDF5 GINI Nexrad … DMSP
  • 6.
    NetCDF-4 and Common DataModel (Data Access Layer)
  • 7.
    I/O Service Provider Implementations • • • • • • General:NetCDF, HDF5, OPeNDAP Gridded: GRIB-1, GRIB-2 Radar: NEXRAD level 2 and 3, DORADE Point: BUFR, ASCII Satellite: DMSP, GINI In development – NOAA: GOES (Knapp/Nelson), many others
  • 8.
    Coordinate Systems needed •NetCDF, OPeNDAP, HDF data models do not have integrated coordinate systems – so georeferencing not part of API – Need conventions to specify (eg CF-1, COARDS, etc) • Contrast GRIB, HDF-EOS, other specialized formats
  • 9.
    NetCDF Coordinate Variables dimensions: lat= 64; lon = 128; variables: float lat(lat); float lon(lon); double temperature(lat,lon);
  • 10.
    Coordinate Variables – One-dimensionvariable with same name as its dimension – Strictly monotonic values – No missing values The coordinates of a point (i,j,k) is {CV1(i), CV2(j), CV3(k)}
  • 11.
    Limitations of 1DCoordinate Variables • Non lat/lon horizontal grids: float temperature(y,x) float lat(y, x); float lon(y, x); • Trajectory data: float NKoreaRadioactivity(pt); float lat(pt); float lon(pt); float altitude(pt); float time(pt)
  • 12.
    General Coordinates inCF-1.0 float P(y,x); P:coordinates = “lat lon”; float lat(y, x); float lon(y, x); float Sr90(pt); Sr90:coordinates = “lat lon altitude time”;
  • 13.
    Coordinate Systems (abstract) •A Coordinate System for a data variable is a set of Coordinate Variables2 such that the coordinates of the (i,j,k) data point is {CV1(i,j,k),CV2(i,j,k),CV3(i,j,k),CV4(i,j,k)…} previous was {CV1(i), CV2(j), CV3(k)} • The dimensions of each Coordinate Variable must be a subset of the dimensions of the data variable.
  • 14.
    Need Coordinate AxisTypes float gridData(t,z,y,x); float time(t); float y(y); float x(x); float lat(y,x); float lon(y,x); float height(t,z,y,x); float radialData(radial, gate) float distance(gate) float azimuth(radial) float elevation(radial) float time(radial)
  • 15.
    The same?? float stationObs(pt); floatlat(pt); float lon(pt); float z(pt); float time(pt); float trajectory(pt); float lat(pt); float lon(pt); float z(pt); float time(pt);
  • 16.
    Revised Coordinate Systems 1.Specify Coordinate Variables 2. Specify Coordinate Types (time, lat, lon, projection x, y, height, pressure, z, radial, azimuth, elevation) 3. Specify connectivity (implicit or explicit) between data points – Implicit: Neighbors in index space are (connected) neighbors in coordinate space. Allows efficient searching.
  • 17.
    Gridded Data float gridData(t,z,y,x); floattime(t); // Time float y(y); // GeoX float x(x); // GeoY float z(t,z,y,x); // Height or Pressure • Cartesian coordinates • All dimensions are connected Connected means Neighbors in index space are neighbors in coordinate space
  • 18.
  • 19.
    Scientific Data Types •Based on datasets Unidata is familiar with – APIs are evolving • How are data points connected? • Intended to scale to large, multifile collections • Intended to support “specialized queries” – Space, Time • Corresponding “standard” NetCDF file conventions
  • 20.
    Gridded Data • Cartesian coordinates •All dimensions are connected • x, y, z, time • recently added runtime and ensemble • refactored into GridDatatype interface float gridData(t,z,y,x); float time(t); float y(y); float x(x); float lat(y,x); float lon(y,x); float height(t,z,y,x);
  • 21.
    GridDatatype methods CoordinateAxis getTaxis(); CoordinateAxisgetXaxis(); CoordinateAxis getYaxis(); CoordinateAxis getZaxis(); Projection getProjection(); int[] findXYindexFromCoord( double x_coord, double y_coord); LatLonRect getLatLonBoundingBox(); Array getDataSlice (Range[] …) GridDatatype makeSubset (Range[] …)
  • 22.
    Radial Data • Polar coordinates •All dimensions are connected • Not separate time dimension radialData(radial, gate) : distance(gate) azimuth(radial) elevation(radial) time(radial)
  • 23.
    Swath • lat/lon coordinates • notseparate time dimension • all dimensions are connected swathData(line,cell) lat(line,cell) lon(line,cell) time(line) z(line,cell) ??
  • 24.
    Point Observation Data •Set of measurements at the same point in space and time • Point dimension not connected float obs1(pt); float obs2(pt); float lat(pt); float lon(pt); float z(pt); float time(pt); Structure { lat, lon, z, time; v1, v2, ... } obs( pt);
  • 25.
    PointObsDataset Methods // Iterator<StructureData> IteratorgetData( LatLonRect boundingBox, Date start, Date end);
  • 26.
    Time series StationData Structure { name; lat, lon, z; Structure{ time; v1, v2, ... } obs(*); // connected } stn(stn); // not connected
  • 27.
    StationObs Methods // List<Station> ListgetStations( LatLonRect boundingBox); // Iterator<StructureData> Iterator getData( Station s, Date start, Date end);
  • 28.
    Trajectory Data • ptdimension is connected • Collection dimension not connected Structure { lat, lon, z, time; v1, v2, ... } obs(pt); // connected Structure { name; Structure { lat, lon, z, time; v1, v2, ... } obs(*); // connected } traj(traj) // not connected
  • 29.
    Profiler/Sounding Station Data Structure{ name; lat, lon, time; Structure { z; v1, v2, ... } obs(*); // connected } loc(nloc); // not connected Structure { name; lat, lon; Structure { time, Structure { z; v1, v2, ... } obs(*); // connected } time(*); // connected } stn(stn); // not connected
  • 30.
    Unstructured Grid • Ptdimension not connected • Looks the same as point data • Need to specify the connectivity explicitly float unstructGrid(t,z,pt); float lat(pt); float lon(pt); float time(t); float height(z);
  • 31.
    Data Types Summary •Data access through a standard API • Convenient georeferencing • Specialized subsetting methods – Efficiency for large datasets
  • 32.
    Payoff N + Minstead of N * M things on your TODO List! File Format #1 CDM Visualization &Analysis NetCDF file File Format #2 OpenDAP Server File Format #N WCS Service Web Service
  • 33.
    THREDDS Data Server HTTPTomcat Server Catalog.xml THREDDS Server •OPeNDAP •HTTPServer •WCS NetCDF-Java library hostname.edu Datasets IDD Data Application
  • 34.
    Next: DataType Aggregation • • Workat the CDM DataType level, know (some) data semantics Forecast Model Collection – – • Combine multiple model forecasts into single dataset with two time dimensions With NOAA/IOOS (Steve Hankin) Point/Station/Trajectory/Profile Data – – Allow space/time queries, return nested sequences Start from / standardize “Dapper conventions”
  • 35.
  • 36.
    Conclusion • Standardized DataAccess in good shape – HDF5, NetCDF, OPeNDAP – Write an IOSP for proprietary formats (Java) • But that’s not good enough! • To do: – Standard representations of coordinate systems – Classifications of data types, standard services for them

Editor's Notes

  • #8 Diversity of formats:
  • #9 Appropriate design decision for General formats
  • #34 Need more dynamic system for real time and very large datasets. Catalog is a file, but these are services, that is, code. Show IDD Server catalog – show sattellite DQC, then show radar DQC