Interoperability issues for the VO

What do we mean by "interoperability"

What interoperability do we want to achieve and how might we know when we have achieved it? I think the VO should be able to do the following tricks:

present a query on standard parameters to any archive without human translation of the parameters;
find out from any archive what other (non-standard) parameters can be used in a query in a human-readable form;
understand the results of queries sufficiently well that tables of results can be merged (e.g. combine all columns expressing redshift);
understand the results well enough that the data can be transformed (e.g. from magnitude to flux density);
understand the results well enough that the tabular results can be plotted as an overlay to an associated image.

"Understanding the results" implies that there exist

semantic definitions for the quantities to be understood (e.g. what does "V magnitude" really mean);
machine-readable names for columns in results tables that are entirely standard;
some way of expressing the units of measurement that is machine readable.

These criteria in turn imply a language for expressing queries that is richer than SQL or a GET CGI-call.

Dealing with non-standard query-parameters -- i.e. those that don't satisfy the above criteria -- implies a language for describing these parameters that can be passed through a system to a UI and there generate a form for a human user to fill in.

Some existing standards and solutions

UCD

The unified column descriptions devised at CDS are a set of machine-readable, standard names for quantities that can be columns of results, and, presumably, parameters in queries. The UCD namespace is rich, and is extensible through being hierarchical. This is exactly what we need.

However, not all the names in UCD have obvious, unique semantics. An example is magnitudes and colours. PHOT_INT-MAG_V is integrated or isophotal V-magnitude, clearly, but which kind of V? PHOT_JHN_V is definitely Johnson V-magnitude, but how does this interact with the PHOT_INT-MAG series? PHOT_JHN_V_I is "Johnson" V-I colour; but which I band is implied? For automatic handling, all these aspects need to be made explicit.

We need a sub-set of UCD that is to be understood by VO software, and for that sub-set we need to define the semantics. We may find that the standard set of UCD names includes some new ones.

UCD does not define the units of measurement. A UCD label for a quantity needs to be qualified with a statement of the units.

AstroRes

AstroRes is an XML vocabluary for describing and encapsulating tables of results. It defines a decriptive header for a table in a way analogous to a FITS header for a FITS table. The actual data of the table can be included either as a compact ASCII representation in CSV format or as individual elements of XML. Alternatively, the AstroRes header could be stored and transmitted separately from the data.

AstroRes descriptions can identify columns by UCD, can attach human-readable labels and descriptions, and can specify units.

AstroRes is used in Vizier (it is one of number of possible output-formats, and about one third of all requests ask for AstroRes), by OASIS, and at HEASARCH.

AstroRes seems to be trying to do exactly what we need according to the criteria at the start of this paper. The fact that it is widely used in getting data from Vizier implies that it works. Whether it works well enough in the context of a full VO is not known yet.

The encoding of metadata in AstroRes is good; the encoding of the actual data is weaker. How well can it handle vast tables? (E.g. could it handle sensibly the transfer of an entire survey from one site to another?) Does it need a binary representation for data? What tools exist to parse and use the format?

AstroRes is likely to be superseded by VOtable (see below).

VOtable

VOtable (no reference available yet) is an expansion of AstroRes. The main changes concern the data component:

Binary tables are supported. Two different binary formats are allowed, one of which uses a complete FITS-file as the data component.
The data component in the XML document need not embed the actual data. There is a syntax to convey a URI for an external data-file.
Various encodings of the data are considered, notably compression algorithms.

The initial paper describing VOtable suggests that it could be used to define queries as well as results. This could be done by annotating the XML header and sending it with no accompanying data-component. However, the existing VOtable vocabulary does not seem to me to allow proper expression of queries; some expansion would be needed.

VOtable might become a joint standard for AstroGrid, AVO and NVO. However, the current paperwork is not precise enough to allow adoption as a formal standard.

ASU

The Astronomical Server URL (ASU) from CDS is a proposed standard for the CGI interface to a data-service. It defines an exact syntax for

querying by postion
querying by supplying an explicit string of SQL
selecting a particular data-set when the service can query many
requesting an output format.

ASU seems to satisfy the criteria for allowing any archive to accept a query in standard language. However, there are problems:

the full syntax is only define for queries by position
in the query-by-position syntax there are many alternative forms, making it hard to code an interface to cover the full standard
ASU is defined in terms of a "GET" CGI-call: it is limited to what can be written into a URL.
it is specific to CGI interfaces.

The variety of syntax is a historical accident. ASU is the union of some syntaxes used for a number of archives.

It seems to me that ASU, in its current form, isn't complete or flexible enough to help the VO very much. If data centres choose to implement ASU interfaces, then it makes interoperability a little easier, but the gain isn't enough to make it worth imposing ASU on all data-centres.

At present, ASU is a loose convention, not a formal standard. If we wanted to make it a standard, then we would have to refine it. At present, if a data-service claims conformance to "ASU", then a client has no way of knowing which parts of the syntax it supports and hence what queries might be acceptable. To resolve this, ASU needs to be broken down into uniquely-named profiles for each of which there is only one syntax. Any given data-service can then state which of the profiles it supports and client software can use this to format queries reliably. Ideally, the global VO should pick exactly one profile as the favoured standard and lobby for all archives to move to this standard.

GLU

GLU, the Generateur de Liens Uniformes from CDS, serves two purposes. Firstly, it provides indirection for URLs such that the physical URL of a data-service can while the public URL stays the same. Secondly, it can alter the syntax of a URL that is a call to a CGI interface, thus rearranging a query to suit the conventions of the data service.

The first feature is useful, but outside the scope of this paper. The second feature might satisfy the need for all data-services to accept queries on standard parameters. That is, the GLU server might be arranged to accept all standard query-parameters and to map them to the syntax of the individual data-services. I have too little experience with GLU to know if it powerful and flexible enough for this job.

Mocha

Mocha, Middleware based On a Code sHipping Architecture from the University of Maryland, is a software system that implements a different way of defining standard quantities in queries and results. Instead of using a de jure definition in a language like XML, it defines quantities by their implementation as Java objects. An object can represent a specific query-parameter if it implements a general interface for query parameters. The actual code for these objects is serialized and passed between programmes in the system. Hence, the queries and results become self-defining at a deep level.

I have not experimented with MOCHA; I do not know how well it works. I mention it only as an example of a radically-different approach.

Proposed new standards

Quantities

This is a list, probably rather incomplete, of things that might be search constraints on a data grid. It came out of a brain-storming session between N. Walton and G. Rixon.

Observing parameters:

angular resolution of instrument
spectral range
time/date of obs
time resolution
depth of observation: S/N
type of instrument
(u,v) coverage (radio); by extension any coverage in sparse raster
polarization capability (e.g. detects circular polarization)
wavelength/frequency coverage
spectrophotometric system of measurement
spectral resolution

Astrophysical parameters of individual objects:

celestial position
radial velocity
proper motion
parallax
radial distance
redshift
object type (star/galaxy/quasar etc.)
size of object
ellipticity of image of object
variability
temperatures
gravities
element abundances
equivalent widths of lines
spectral types of stars
magnetic data
whether in cluster
morphological class of galaxy
velocity dispersion
brightness: flux-density or magnitude
"redness", "blueness", "steepness of spectrum" etc.

Reduction parameters:

observed/reconstructed quantity (e.g. for broad-band magnitudes)
background-estimation technique in IR reductions

Image quality:

image width (seeing etc)
aberations of images
ellipticity of images of unresolved sources

Archive parameters:

proprietary/public
published/not published
distance from search centre
owner of data
creator of data

Query language

I believe that we should define an XML vocabulary to describe queries. It must serve two purposes: to express queries to the grid and to data-services; to express possible queries to the user interface in order to get human intervention with the difficult cases.

The queries in the hypothetical language must be transformable into SQL at the data centres. For ease of use, they should also be transofrmable to ASU in the grid.

Priorities

The highest priorities as I see them are enabling queries according to brightness and colour or objects -- i.e. according to the spectral characteristics -- and making it easy to do good overplots of catalogue on images. These were the things most wanted in the portal experiment carried out at CASU in June to October 2001.

Wavelength/brightness domain

I think the quantities we most need to standardize are to do with spectrophotometry. I use "spectrophotometry" to cover SED measurements with both imaging and spectrographic instruments, and I believe that we need common representations for data from both classes of instrument.

We need a standard description of wavelength coverage. This will probably become part of the static description of data resources.

We need a way of transforming all spectrophotometry onto a common scale. Transforming everything to flux density is the obvious first step.

We need a standard description of the type of spectrophotometric data. Data that are true Sloan phtometry, say, need to be labeled as such so that they can be correctly transformed. Data that are derived measures (e.g. Sloan magnitudes synthesized from measured spectra) also need to be distinguished.

Position/size/shape domain

Celestial position is fairly well covered by existing convention. We need to add sufficient standard quantities to support overplotting of arbitrary data on images, as in Aladin, Gaia etc.

Where possible, the overplots should be drawn as ellipses. This means that we need standard quantities to give the position angle, ellipticity, and ellipse width. The width (or half-width or whatever) might not be a physical parameter of the object, but might be derived from magnitude.