[RFC-004]Comment-3
Organization: IRI
Review of the DAP 2.0 Standard (14 Sep 2004)
Let me start by saying that DAP 2.0 is an essential part of the way we handle data. We use it to transfer data from our collection to applications (particularly Matlab and GrADS), we use it to transmit data and data architecture between servers so that our collection of servers appears as a single entity, we use it to access and process data residing at other institutions that are using OpenDAP servers to serve data (almost always software other than what we use). While our servers offer all of our data in a multitude of formats, DAP downloads constitute more that half our downloads, primarily because applications can directly access the data that is needed for a particular analysis. This is particularly important for large datasets, where the effort required to organize the data would be substantial if one were to simply download the entire datasets, and frequently is beyond the ability of the average user.
DAP 2.0 is particularly important for us because it allows us to transmit the structure of any analysis that can be performed by our servers, as well as the original data. Unlike many other standards that have metadata and/or structural requirements that make it impossible to transmit many sorts of analyses (i.e. insisting on fully conforming metadata or insisting on spatial data or small fixed number of dimensions), DAP 2.0 allows us to transmit any analysis we can perform on our data, as well as structure our data with as many dimensions as is appropriate (time, space, height, ensemble member, start time, lead time, ...). This means that analyses that happen to be beyond the scope of a particular piece of software can be requested on the server, and then operated on by the software that the user is comfortable with. This also allows us to replicate our server, so that alls its material can be made available (or used as input) on another server.
The key point in all this is that DAP addresses a particular part of the data problem: transport of dataset structure, metadata, and data. Metadata standards are important for data reuse, but whatever metadata standard is chosen, the transport mechanism can be reused.
The experience of developing DAP as embodied in DAP 2.0 has been an important one for the community, and I think there have been some valuable lessons. An important part of DAP from the beginning was that there were always several API's to connect software to the DAP libraries. In particular, there is a netcdf API, that allowed a netcdf program to access a DAP dataset simply be relinking the code. There is a matlab connector, that reads DAP data into matlab arrays, and more recently, matlab structures. And there is a Java library, that reads the data into objects. Also a connector for Ingrid (an object-oriented data flow server/client) was developed fairly early on. Multiple APIs are very powerful, because now these different programs with different data models can interchange data.
The problem comes in that the API's did not take full responsibility for mapping the entire DAP data model into the data model of their particular API. Sometimes this is not so bad: matlab arrays hold the data for pretty much the span of DAP data objects. But the netcdf API is more ambitious in that it uses the structure of the data as well as the numerical values. First problem was that DAP allows nested structures, netcdf has a flat variable name space: the interface did not at first take responsibility for mapping the nest into extended names. I believe now it does. But more subtly, netcdf has global dimensions where DAP in theory can have different MAP-d variables in different GRIDS/structures. Again, the netcdf API did not take responsibility for generating global dimension names, so locality was essentially ruined if one wanted one's datasets readable by the netcdf interface. This because more widely apparent with the Java interface, where locality is naturally honored. (While the Ingrid interface has locality, its DAP connection maps the dimension names into a flat space before transmitting the dataset, and reverses the process when reading the dataset structure back in But Ingrid is not widely understood, so common practice did not change). The other half of this, is that pure locality is actually a fairly bad idea when transmitting most of our datasets -- they usually have many common dimensions, information that is important to retain in transmitting the dataset structure. So an essential feature of the next generation of DAP will be to enforce the name locality that is natural in object-oriented data models, but have a mechanism for indicating common dimensions among different parts of the datasets. Sequences also are an issue -- the netcdf API does not translate them, either. I believe an important feature of the next DAP will be to have each API fully map the DAP data model into the data model embodied in the API, taking responsibility for all the transformations necessary.
Transmitting metadata is definitely something the next version of DAP should do better, though it is not totally clear what needs to be transmitted. But just as datasets have structure that needs to be transmitted (an issue that DAP addresses), metadata also has structure that needs to be transmitted. Clearly DAP already associates attributes with data objects, an important part of metadata structure. But there is additional information that traditionally has been poorly structured: what standards do the attributes belong to, how do different attributes interact, what is the range of possible values for a particular attribute? This is a key part of meaningful attributes, and if DAP can provide a transport mechanism for this information as well, the transmission of datasets will again be enhanced.
In short, I think DAP 2.0 is an essential dataset transmission standard, and I believe it will evolve to become an even better dataset-transmission solution.
A particular point.
There is one edit I would very much like to make in the standard as it was written 14 Sep 2004:
I quote: (page 22)
Section 7.1.4 Date
The Date header provides a time stamp for the response. This header is needed for servers that support caching.
This is literally true, but extremely misleading. The Date header line is for computing a time correction between the client's time and the server's time, should there be any difference. Nice for caching, but not the whole story by far, particularly if there are no other time-related headers in the http header set. If there are no other other times in the header, this is useless.
Last-Modified
which should be the time that that DAP response last changed, a function of the data last changing and the server code last changing. Without a Last-Modified tag, the response will not be cached.
Another quite useful (but not required) http header line is
Expires
which gives the cache a time before which it is unnecessary to check for changes to the DAP response. The cache, of course, is free to use this information or not, and in the absence of an Expires line, makes an algorithmic decision about whether to query the server again.
There are other cache-related lines that are useful. While technically part of HTTP1.1, they are used by the current generation of caching software. In particular,
Cache-Control: public
will allow a cache to cache a response that it might normally consider uncachable. Also,
Vary:
allows the server to specify which header lines need to be considered in order to decide if two responses are identical, in particular
Cache-Control: public Vary: Authorization
will allow a cache to cache passworded responses, the responses being cached separately for each user. Vary should be used if the server returns different responses depending on what it reads in the header lines of the request, i.e. different formats or compensating for different versions.
How much of this needs to be in the standard, I don't know. But mentioning the Date: line by itself will be certain to lead developers to ruin cachability. My guess is that leaving it implicit is a bad idea, because developers will simply not get it right, so I would hope that at least this much information be included. It should be made clear, of course, that DAP 2.0 is simply pointing out relevant parts of the HTTP standard.