Version 1.9 Build 1367

News	FAQ
Search	Home

Next: Linear and Non-Linear Algebra Methods and Objects Up: No Title Previous: Introduction

Subsections

Basic Data Values and Objects

Data Values

From the point of view of the mathematics, there are six basic data types that are potential elements of all data objects:

numeric (ordinary numbers, stored as integer, real, or double)
logical (TRUE or FALSE with associated logical operators)
complex (complex number numeric values)
strings (sequences of characters)
NA (a Not Available indicator for data values that plays a general role in mathematics to represent missing data, like the blank pixel concept in astronomy, IEEE NaNs, or the results of indeterminate computations like 0/0)
NULL (a concrete return type for any data value; useful for positively returning no values or for establishing whether values were returned)

The first four types of data values (numeric, complex, logical, and strings) represent four ``modes''. Additional modes can be defined for higher level constructs. The use of fuzzy numbers that is being investigated for error propagation in AIPS++ will use constucts based upon vectors of numeric or complex type, with special rules for their arithmetic, but will not be discussed further in this document.

The use of NA and NULL at the basic level of data values seems to allow added flexibility in mathematical algorithms, with some methods using, or allowing, these data values, and some methods requiring that NA or NULL not be included in the data objects.

Now we can define a data object as an atomic or non-atomic collection of abovementioned data values. In atomic data objects all data values are of the same mode (numeric, complex, logical, and strings), whereas in non-atomic data objects there are mixtures of atomic data objects with different modes.

Data Objects

The basic data objects are defined by their attributes, with length, mode, dim (dimension), dimname, and class being most fundamental. The following table summarizes the basic mathematical data objects, their attributes, and their role.

                      Table 1 - Data Objects
Class   Atomic          Attributes                 Role
Mvector  T      Length  Mode   Dim  Dimname        Most basic data object
Mmatrix  T      Length  Mode   Dims Dimnames       Rows/columns of vectors
Marray   T      Length  Mode   Dims Dimnames       N-dimensional array
Mlist    F      Length  Mode   Names               Ordered collection
                                                   of data objects
Mtable   F      Length  Mode  Names  Row.Names     Generalized table with
                                                   columns of numeric, logical
                                                   or character data values
Factor   F      Length  Mode   Names  Levels       Qualitative identification
                                                   and labeling of data
Grid     F      Length  Mode  Dims Dimnames Coords N-dimensional array
                                                   with even axis intervals

The first three data objects in Table 1 are augmentations of the classes already developed for AIPS++, with the specific addition of Dimname attributes for each dimension. These are named mvector, mmatrix, and marray to distinguish them from the AIPS++ classes that have already been implemented; dimnames allow a simple assignment expressions between these data objects and mtables, and allow all data objects to have vectorized selection/logic based upon key words.

The mtable data object is a specific view, or form, of an AIPS++ table that allows one to easily compose and decompose it from/to other data objects using methods related to mvector, mmatrix, marray, and other mtable data objects. As with an AIPS++ table, an mtable can contain columns of any of the other atomic data objects.

While many of these data objects will be small enough to fit into memory, with many cases of interest that will not be true; however, the use of buffered I/O is a prime example of an implementation detail that should be hidden as part of the general data base manager for objects of all kinds.

The mlist data objects are associations of other data object components that are formed by a mlist(o1, o2, ..., oN) method, and which have a syntax allowing mathematical operations on component and sub-component data objects. The mlist data object allows simple association of related data objects resulting from methods or more complicated multi-object algorithms, without requiring construction of new kinds of data objects, since they are just different mlists of standard data objects. For example, one can map a AIPS FITS image into an image object with the method mlist(labels=labelvector,values=valuevector,axes=axesmatrix, pixels=imagearray) where labels is a vector of string-like header information, values a vector of the global numeric information for the image, axes is a matrix of numbers describing the values (and state) of the image coordinates, and pixels is the array of numbers which contain the image values. Because of the dimnames attributes of vectors and arrays, the keyword=value syntax of FITS images maps directly into vector, matrix, and array data objects. The data object components of observations, measurement sets, telescope models, etc., can be formed, referred to, and operated on with a combination of the syntax of mlist and mtable methods. The following is an example of a listing of contents of an mlist image data object derived from AIPS/FITS:

labels:
                    NAME
OBJECT          "NCYG92"
TELESCOP           "VLA"
INSTRUME           "VLA"
OBSERVER "R.M.HJELLMING"
   UNITS       "JY/BEAM"

values:
          VALUE
 NAXIS 4.00E+00
 EPOCH 1.95E+03
 SCALE 2.00E-04
OFFSET 0.00E+00
 BLANK 0.00E+00

axes:
         RA.SIN.DEG  DEC.SIN.DEG      FREQ.HZ
  DIM  1.750000E+02 1.750000E+02  1.00000E+00
CRBLC  1.600000E+02 1.600000E+02  1.00000E+00
CRVAL  3.072800E+02 5.246208E+01  2.24851E+10
CRINC -6.944444E-07 6.944444E-07 -5.00000E+07
CRREF -2.830000E+00 0.000000E+00  0.00000E+00

pixels:

[matrix of numbers]

Flexible mlist construction may be more of a UI-related operation because of the difficulties of implementation in C++.

Factor data objects allow useful identification of qualitative descriptions of data that can be utilized by logical operations in array-oriented or vectorizable algorithms. Each is essentially a vector of integers identifying levels, with an associated vector of names for each level. Constructs like this can be used for many things, e.g. data quality identification, weights, source/field identification, and so on. Factors are concrete classes that are a bridge between numeric arrays and keyword identification used in vectorized operations. An example of a factor object is the following, where the data in the object is a vector of integers identifying different ``levels'' in the object, and levels is a vector of strings indicating that each level identifies a source name: values: 3,3,3,3,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1

levels: ``2005+403'',``NCYG92'',``3C286'' where this vector matches, say, a row of source data so one can operate on the source data using selection methods attached to the factor data object.

The grid data object is one of the most interesting, and a major component of many applications, while retaining entirely mathematical characteristics and methods. In the astronomical context it has the mathematical essence of a time series, a spectrum, a gridded u-v array, and a regular image. In AIPS++ the work on GridTool and FFTtool has developed some of the methods for grid data objects. As a higher level component of the data objects in Table 1, grid is a mvector, matrix, or array, with the added attribute of coord = (begin, end, interval, units), and methods for extracting selected coordinate information and scaling coordinate values. Methods of the grid data objects will use linear equations involving the elements of the coord attribute and the range of indices for each axis.

It is possible that units for pixels should be additional grid attributes, since they affect the scales of stored numeric values, but it is clear that other astronomical content, like representation, reference frame, measure, etc., belongs to higher level classes. The dimnames attribute of grid objects allows keyword identification of each dimension. This may be sufficient for including units since one can use dimnames with labels like ``u.nanosecond''. Later we will discuss a more extensive list of methods for grid data objects.

The evolution of the design and implementation of the framework of classes for images is where further isolation of a possible grid class should be examined, ensuring that it has methods that are generically mathematical and independent of the image-handling problem.

Reinforcing the view that a mathematically defined grid data object is a powerful construct for algorithms, it is possible to use FFT and related methods in S /S+ to produce, and operate on, their matrix and time series objects to represent, and make transformations between, gridded u-v data and images. The emphasis on optimizing for mathematical operations wherever possible leads to methods like ifelse(logicalexpr, expr1, expr2) where a logical operation on an atomic data object results in choice between two expressions, expr1 and expr2, for the element-by-element operation involving that object, depending upon the result of the logical operation for each element. This type of mathematical operation (and the analogous if, switch, all, and any methods, cf. Table 2) is part of the reason for both factor data objects and the association of dimnames with vectors, matrices, and arrays, since names as keywords can then be used in logical operations. The current work on masked arrays is related to the development of these sorts of vectorize logic operations.

Formation, Testing, and Coercion of Data Objects

Formation of more complicated data objects from simpler ones, and vice versa, should be possible with simple syntax that hides the vector, matrix, etc., nature of the objects. Vector data objects can be formed by a sequence method, a repetiton method, or a combine(o1, ..., oN) method. Vectors should be formable into matrices with methods like rbind(o1, ..., oN) (for rows) and cbind(o1, ..., oN) (for columns). All data objects should be formable into mlist data objects by the mlist (o1, ..., oN) method, and atomic data objects formed into mtable objects with an mtable (o1, ..., oN) method. When implemented in C+ the first argument for each methods will be the number of elements in each object list. The reverse extraction of simpler data objects from more complicated ones is a question of extraction based upon some multi-level ``subscript'' notation.

Testing and coercion are useful concepts for mathematical handling of different, but relatable, data objects. Each data object can have a is.objecttype(object) method that tests for what is needed in some mathematical expression or algorithm, returning TRUE or FALSE, and a as.objecttype(object) method then returns a different type of data object that can be formed from the input object.. The testing method is useful in algorithms. The coercion method aids extraction of one type of data object (e.g. mmatrix) from others (e.g. mtable).

The Formula Class

The need for using arbitrary formulas or equations in the fitting or modeling of data is obvious, and a construct that could be useful for supplying formulas to methods, producing strings in the form of formulas in labels, and doing some symbolic algebra and symbolic evaluation, is the formula class. As a data object it takes as input string of characters identifying arithmetic operations, variable names, and parameters to be determined. It has basic symbolic algebra methods like substitute, parse, expression, derivative, evaluate, etc., that allow mathematical expression, decomposition, manipulation, and use in evaluation of symbolic expressions to return values for the quantities modeled by the formula(s).

Methods for Atomic Data Objects

All the ordinary operator-like operations involving vectors and matrices are assumed to be present, with * used for element by element multiplication as done with the AIPS++ Array class. Using a crossprod method for M x V and M x M with M and V (or M) as arguments, is reasonable. However, in a mathematical system there are distinctions based upon whether the vector or matrix is a transpose or not that can be checked at run time based upon a transpose of non-transpose identification, or left as a potential programmer error. Probably it is best to expect the application programmer to write tran(V) or tran(M) when mathematically required.

Table 2 lists methods for various atomic data objects, with V, M, A, and G indicating whether they apply to mvector, mmatrix, marray, and/or grid data objects, and N, C, and/or L indicating whether they are applicable to numeric, character, and/or logical data values.

                           Table 2
             Methods of Atomic, Numeric Data Objects
combine     V    NC  combine mlist of numbers into a vector
rep         V    NC  form vector replicating mlist of numbers
sequence    V    N   form vector from vstart to vstop using optional step
                     or length parameters
tran        VMAG N   tranpose
diag        V    N   from diagonal matrix with input vector on
                     diagonal
rbind       V    NCL from matrix from mlist of vector objects with
                     each vector becoming a row
cbind       V    NCL from matrix from list of vector objects with
                     each vector becoming a column
sort        V    NC  sort vector on elements 
reverse     V    NC  reverse elements of vector (often after sorting)
order       V    NC  return integer vector containing the permutation
                     that will sort teh input into ascending order
rank        V    N   returns a vector with ranks of the input vector
diff        VMAG N   returns a VMAG with the differences between
                     adjacent elements of the input data object
unique      V    NC  returns an object like the input but with 
                     repeated values deleted
duplicated  V    NC  returns a vector of logical values for an input object
                     indicating whether elements are duplicated or not
sum         VMAG N   returns the sum of all elements of input object
prod        V    N   returns the product of all elements of input object
max         VMAG N   returns largest value in input object
min         VMAG N   returns smallest value in input object
range       VMAG N   returns vector of smallest and largest values
all         VMAG L   returns TRUE if all elements of input logical 
                     expression(s) are TRUE, returns FALSE otherwise
any         VMAG L   evaluates to TRUE if any elements of input
                     logical expression are TRUE, returns FALSE otherwise
if          VMAG L   evaluate an expression for each element if a
                     logical expression for each element is true
ifelse      VMAG L   depending upon logical expression evaluation for
                     each element, performs onee of two operations on 
                     or with each element
switch      VMAG N   depending upon the integer returned by an
                     expression one of a series of expressions is
                     used to used to return a value for each element
                     of the input data object
apply       VMAG N   Apply a function defined by a formula object
                     to all elements of the input data object
outer       VMAG N   Apply a function defined by a formula object
                     to two input data objects with the same shape
mean        VMAG N   returns mean of all elements of data object,
                     optional trim parameter specifying range of
                     values to be averaged
median      VMAG N   returns median of all elements of data object,
                     optional trim parameter specifying range of
                     values to be considered
quantile    VMAG N   returns vector of desired probability levels
                     for a data object, as determined by optional input
                     vector of desired probabilities
var         VM   N   returns variance of data object (for optionally
                     specified range of values); if a matrix, columns
                     represent variables and rows represent measurements
cor         M    N   return correlation matrix for optional range of values
cov         M    N   return covariance matrix for optional range of values
round       VMAG N   return for each element the integer above or
                     below value + 0.5
signif      VMAG N   return for each element a value with rounding in
                     the specified significant figure
cumsum      VMAG N   returns an object for which each element
                     is the sum of all elements to that point
cumprod     VMAG N   returns an object for which each element
                     is the product of all elements to that point
distrib     VMAG N   returns for each element a value of a named
                     probability distribution over an option range of values
fft         VMAG N   transform a real or complex data object by a
                     direct or inverse FFT 
autocorr    VMAG N   return autocorrelation function of data object
lag         VMAG N   return same object with data lagged by specified
                     intervals in one for one or more dimensions
                     (mainly for case of a time-like dimension)
convolve    VMAG N   convolve a function specified by a formula object
                     with a specified span producing a smoothed version
                     of the original data object
aggregate   VMAG n   convolve, average, or smooth one or more
                     dimensions to a data object with a reduced 
                     number of data points spanning the same range
subset      VMAG N   return a subsection of a data object based upon
                     a range specification for each dimension
coord       G    N   return coordinate values for specified elements

It is obvious from many of these methods that this puts considerable emphasis on vectorizable operations so one can express mathematics with operations that are accomplished as efficiently as possible by internal mechanisms hidden from the programmer.

The basic constructor methods (mvector, mmatrix, grid, etc.) have obvious use and syntax, and some of the operations in Table 2 reflect other ways of constructing these objects. The is.class for testing and as.class methods for testing classes and coercing classes important for specific use and decomposition of mathematical objects. Methods like assigndim and assigndimnames are needed as part of of the composition of data objects.

Next: Linear and Non-Linear Algebra Methods and Objects Up: No Title Previous: Introduction