Version 1.9 Build 1556

News	FAQ
Search	Home

Next: Array and Image Plane Computations Up: The Design of AIPS++ AIPS++ Implementation Memo 111 Previous: Astronomical Calibration and Imaging

Subsections

Table Data System

The underlying raw data for both radio interferometry and single dish is often both quite complex and quite voluminous. Moreover the access patterns users and programmers desire is often quite demanding.

AIPS++ has chosen tabular interface as the fundamental interface to data. Tables in general have been a very successful data type in many astronomical data processing systems (AIPS, IRAF/STSDAS, Midas). Tables are also widely used in FITS for non-image data. The AIPS++ Table interface, as described in the next section, is similar in spirit, although considerably different in detail, than these others.

We have separated out from the Table interface exactly how the bytes are staged from disk (or, indeed, from elsewhere). Indeed, a given table can have different parts which are handled separately. The details about this separation are described in the section on Data Management (section 4.2). This separation of the data interface from the details of its implementations allows us to use different (possibly data dependent) I/O strategies, which the user of data need not be aware of (the creator of the data needs to set up the strategies that seem to be appropriate).

Table interface

An AIPS++ table consists of a header, and a main data table.

The main data table consists of a number of rows and columns. A value is stored at the intersection of each row and column. All values in a column must be of the same type.

The header consists of a set of keywords. A keyword is a named value (``keyword=value'' pair). Keywords can either be associated with an entire table (e.g., general information about the observation) or with a particular column (e.g., units for the values in the column).

A value is normally one of the following types (see Virtual columns in section 4.2.2 for a generalization):

A scalar: All the usual types are available: integer (short and long), floating point (single and double precision), complex (single and double), string, binary string, and boolean.
An array: An array of the above scalar types may be a table value. The array may be of any dimensionality and shape.
A table: A table itself may be stored in a column or a keyword. In this way, the table data structure is hierarchical and can directly support groupings (e.g. many-to-many relationships, or attaching a calibration table to a dataset).

Note in particular that any value which may be stored in a column, may also be stored in a keyword. Thus one can, for example, store a rotation matrix in a single keyword rather than having to encode it in multiple keywords.

**Figure 4:** An illustration of table features, including tables and multidimensional arrays as values.
$\begin{figure} \epsfverbosetrue \epsfysize=5.0in \epsfbox{dbmanager5.eps}\end{figure}$

An array or table may either be stored directly or indirectly. A direct array or table is embedded directly in its containing table; when in a column, a direct array or table must have an identical structure on each different row¹² An indirect array is stored externally to the enclosing table, and its shape (and hence dimensionality) may vary from row to row.¹³ Similarly, an indirect table may vary in structure from row to row; moreover, an indirect table may be referred (indirectly) from multiple tables. Figure 4 illustrates a possible decomposition of VLBA data into AIPS++ Tables.

**Figure 5:** Overview of the main table classes
$\begin{figure} \epsfverbosetrue \epsfysize=7.0in \epsfbox{DbTable.eps}\end{figure}$

Figure 5 gives an overview of the main table classes. There are classes used to:

Access the data: Where the data might be in columns (ScalarColum, ArrayColumn, SubTableColumn) or in Keywords (TableKeywordSet). Alternatively, a column might be viewed as a vector; this is described in section 4.1.3, below.
Describe the table ``layout'': The description of the entire Table is described by the TableDescription, the description of a particular column by the ColumnDescription.
Iterate through the table: This is described in section 4.1.2, below.
Manage the data: For example, perform I/O to bring the requested data into the user's address space. Data management is described in section 4.2 below.

The structure of a table is described by a table descriptor. A table description can be used to create new tables (i.e., with no rows). Thus a table description can be used both as a template for creating new tables, and for describing the structure of existing tables. Note that in using a table description as a template, it only describes the minimum of what a table must have, additional columns and keywords may be added.¹⁴

``Virtual'' tables

**Figure 6:** More detailed Table class diagram
$\begin{figure} \epsfverbosetrue \epsfysize=7.0in \epsfbox{DbTable1.eps}\end{figure}$

It needs to be emphasized that the Table is an interface to data. The actual data may exist on disk in some files. However, it might also exist in some other underlying table objects, or it might be computed on-demand via some computation when the user requests it.

For example, the data might be stored on disk as 16 bit integers, and ``decompressed'' into floating point for the user. Or, a column might perform an on-the-fly calibration for the user.

Tables where the data are available from files in the normal way are referred to as ``filled'' tables. Tables and columns in which some of the data are computed (or come from some other source) are known as ``virtual'' tables or ``virtual'' columns. This usage of ``virtual'' is probably unfortunate, though descriptive, given the common C++ meaning of that term. The mechanisms by which virtual columns are created are described in section 4.2 below.

Selection and Iteration

A particularly important type of virtual table is one in which all the data is actually in another table. This is known as a reference table. Essentially the reference table has an association with another table, as well as an ordered list of row numbers which map the other table's rows into row numbers of the reference (i.e., virtual) table.

Reference tables are most commonly formed as the result of:

A selection
``All rows where column `Flux' is >= 0.'' Both a C++ set of classes and a grammar exist for performing selections. Both logical operations and arithmetic are supported.
A sort
The table can be sorted using multiple columns as primary, secondary1, secondary2, etc keys (in ascending or descending order).
Manual specification via an array of row numbers and/or column names.

A reference table is thus a new view of an existing table. If the reference table is modified, the underlying data in the original table is changed. While this is normally what is wanted, a reference table may be deepened by making a physical copy if desired.

Another important type of virtual table is the iterator table. One often wants to iterate through a table with a ``cursor'' which is a smaller table than the original (fewer rows and columns). Once the iterator is formed, the columns viewed remain constant. However the rows which are seen change as the cursor is moved through the underlying table. Commonly, a table iterator is used to read through data grouping rows in some specified order, for example, all rows with a given time or baseline. Note that the rows which are contiguous in the iterator need not be contiguous in the underlying table.¹⁵

Table vectors

Often one might want to perform calculations using entire columns. One approach would be to merely read the column into a one-dimensional array and then calculate normally using the available functions which calculate on arrays.

However this is somewhat unsatisfying for the following reasons:

While an entire column can be read into an array in one go, it can still be mildly tedious to get out a bunch of columns, compute with them, and put them back.
Copying the data in and out of the columns might be expensive enough that it would instead be necessary to loop over the rows explicitly, losing the clarity of whole-array arithmetic.
Tables might have very many rows (certainly many millions). Thus the temporary arrays used for the calculation might cause out-of-memory problems.

The solution we have chosen to solve this problem is to introduce the TableVector class. It is logically an entire column which can be manipulated as an array (e.g., arithmetic, logical operations, etc). However, it is not (necessarily) entirely memory resident. The addition of two table vectors would result in a buffer sliding through the table. However, this I/O would be entirely hidden from the user.

Data Management

**Figure 7:** How columns are attached to Data Managers.
$\begin{figure} \epsfverbosetrue \epsfysize=7.0in \epsfbox{DbColumn.eps}\end{figure}$

Data is mapped to and from a table interface via data managers. A data manager fundamentally maps ``get'' and ``put'' requests to the implementation data structures (or functions, for virtual columns). Multiple columns are bound to a data manager, and a table may have one or more data managers attached to it. This is an important part of the design: it allows a single table to have multiple types of underlying I/O (presumably tuned for data dependencies) or virtual columns attached. The classes which are involved in attaching columns to data managers are shown in figure 7.

While the Data management layer is below the level at which table users are required to be knowledgable, it is a level which developers who (particularly) need to add additional types of virtual columns need to be aware.

The creator of a table may also need to be aware of the different types of storage and data managers so he can choose the ones which optimize the access that he foresees.

**Figure 8:** Relationships among DataManager classes
$\begin{figure} \epsfverbosetrue \epsfysize=7.0in \epsfbox{DbDataMan.eps}\end{figure}$

Storage management

Data managers which physically store and retrieve values from a storage device are known as storage managers. Besides staging data to and from disk¹⁶, they are responsible for canonicalizing it (in particular, to IEEE Big Endian) so that computers with different word formats can access the data.

There will be several different types of storage managers in AIPS++, each with different properties. The ones which are either presently implemented or which are being implemented are:

AipsIO: AipsIO is a simple I/O system which is used to store object values (describe in the class reference manual). AipsIO has no partial buffering: an object is either in memory or on disk. As used as a storage manager, all active columns are stored entirely in memory. Thus AipsIO is most useful for columns which are small enough to be memory resident. An exception to this is columns of indirect arrays, which are only read on demand. Moreover, if only a section of the indirect array is desired, AipsIO will only read in that section. Thus, AipsIO is appropriate for small columns and columns of indirect arrays.
Karma: Karma¹⁷ is a storage manager which is optimized for dealing with data which can be organized as hypercubes, for example interferometric visibilities with a constant number of stokes, channels, and baselines per timestamp. It is optimized for taking slices in any direction (for example, all visibilities for a range of channels for a given baseline). It achieves these optimizations through use of ``tiling,'' which is a technique in which the data is broken into chunks which may be readily assembled in any direction.
Miriad-like: When data in some columns varies slowly, a risk is run that the total size of the table will become bloated as the slowly varying column becomes replicated over many rows. Another risk, is that data which logically belongs together will be split apart for this implementation reason. In AIPS++, we have defined a storage manager based on the ideas in the Miriad software system ([STWss]) to alleviate this problem. With the Miriad storage manager, values are only written when they change. That is, the underlying implementation is list-like. (However, indices are layered on top to make random access reasonably efficient). (The need for this is also described in section 3.1.1.)

It should be clear that all of the above have different performance and access requirements.¹⁸ This lets the table creator choose tradeoffs that he feels are appropriate. No software to automatically migrate from one storage manager to another exists yet (short of a physical copy of the table).

Virtual columns

Within this framework, virtual columns may be readily constructed. The only thing that is required is the creation of a so-called VirtualColumnEngine, which is merely a protocol for storing and returning values given a column and a row number.

The first virtual column engine which has been implemented is one in which values of one type are scaled to values of another type via a simple new = old x scale + offset calculation. For relatively low signal to noise data, it can make sense to ``compress'' floating point data down to short integers (for example). However this compression is an optimization that the consumer of the data does not need to be aware of; he just computes normally on his floating point data.

Virtual columns have one capability that filled columns do not: they may contain any type, not just the scalars,arrays of scalars, and tables which may be stored directly.

Status

The design of the AIPS++ Table Data System was initially formulated by A. Farris of the Space Telescope Science Institute, and implemented by G. van Diepen of the NFRA (and also kindly supplied the figures in this section).

The classes and functionality described in this section are entirely implemented with the following exceptions:

Adding a column to an existing table
Tables in columns (tables in keywords are supported).
Integrating the TableVector class with the Lattice class (section 5.1).

The table system saw a complete overhaul in 1994 in response to suggestions from clients of the previous version of the table classes.

Besides finishing the above items, future work will involve such things as I/O optimizations, and improving the ability of end users to directly manipulate tables.

Next: Array and Image Plane Computations Up: The Design of AIPS++ AIPS++ Implementation Memo 111 Previous: Astronomical Calibration and Imaging Contents