IncrementalStMan.h
Classes
- IncrementalStMan -- The Incremental Storage Manager (full description)
Interface
- Public Members
- explicit IncrementalStMan (uInt bucketSize = 0, Bool checkBucketSize = True, uInt cacheSize = 1)
- explicit IncrementalStMan (const String& dataManagerName, uInt bucketSize = 0, Bool checkBucketSize = True, uInt cacheSize = 1)
- ~IncrementalStMan()
- Private Members
- IncrementalStMan (const IncrementalStMan& that)
- IncrementalStMan& operator= (const IncrementalStMan& that)
Review Status
- Reviewed By:
- UNKNOWN
- Date Reviewed:
- before2004/08/25
- Programs:
- Tests:
Prerequisite
Etymology
IncrementalStMan is the data manager storing values in an incremental way
(similar to an incremental backup). A value is only stored when it
differs from the previous value.
Synopsis
IncrementalStMan stores the data in a way that a value is only stored
when it is different from the value in the previous row. This storage
manager is very well suited for columns with slowly changing values,
because the resulting file can be much smaller. It is not suited at
all for columns with continuously changing data.
In general it can be advantageous to use this storage manager when
a value changes at most every 4 rows (although it depends on the length
of the data values themselves). The following simple example
shows the approximate savings that can be achieved when storing a column
with double values changing every CH rows.
#rows CH normal length ISM length compress ratio
50000 5 4000000 1606000 2.5
50000 50 4000000 164000 24.5
50000 500 4000000 32800 122
There is a special test program nISMBucket in the Tables module
doing a simple, but usually adequate, simulation of the amount of
storage needed for a scenario.
IncrementalStMan stores the values (and associated indices) in
fixed-length buckets. A BucketCache
object is used to read/write
the buckets. The default cache size is 1 bucket (which is fine for
sequential access), but for random access it can make sense to
increase the size of the cache. This can be done using
the class
ROIncrementalStManAccessor.
The IncrementalStMan can hold values of any standard data type (thus
from Bool to String). It can handle scalars, direct and indirect
arrays. It can support an arbitrary number of columns. The values in
each of them can vary at its own speed.
A bucket contains the values of several consecutive rows.
At the beginning of a bucket the values of the starting row of all
columns for this storage manager are repeated. In this way the value
of a cell can always be found in the bucket and no references
to previous buckets are needed.
A bucket should be big enough to hold all starting values and
a reasonable number of other values. As a rule of thumb it should be
big enough to hold at least 100 values of each column. In general the
default bucket size will do. Only in special cases (e.g. when storing
large variable length strings) the bucket size should be set explicitly.
Giving a zero bucket size means that a suitale default bucket size
will be calculated.
When a table is filled sequentially each bucket can be filled as
much as possible. When writing in a random way, buckets can contain
some unused space, because a bucket in the middle of the file
has to be split when a new value has to be put in it.
Each column in the IncrementalStMan has the following properties to
achieve the "store-different-values-only" behaviour.
- When a row is not explicitly put, it has the same value as the
previous row.
The first row gets the standard undefined values when not put.
The order of put's and addRow's is not important.
E.g. when a table has N rows and row N and the following M rows
have the same value, the following schematic code has the same effect:
add 1 row; put value in row N; add M rows;
add M+1 rows; put value in row N;
- When putting a scalar or direct array, it is tested if it matches
the previous row. If so, it is not stored again.
This test is not done for indirect arrays, because those can
be (very) big and it would be too time-consuming. So the only
way to save space for indirect arrays is by not putting them
as explained in the previous item.
- For indirect arrays the buckets contain a pointer only. The
arrays themselves are stored in a separate file.
- When a value of an existing row is updated, only that one row is
updated. The next row(s) keep their value, even if it was
shared with the row being updated.
For scalars and direct arrays it will be tested if the
new value matches the value in the previous and/or next row.
If so, those rows will be combined to save storage.
- The IncrementalStMan is optimized for sequential access to a table.
- A bucket is accessed only once, because a bucket contains
consecutive rows.
- For each column a copy is kept of the last value read.
So the value for the next rows (with that same value)
is immediately available.
For random access the performance can be improved by setting
the cache size using class
ROIncrementalStManAccessor.
This class contains many public functions which are only used
by other ISM classes. The only useful function for the user is the
constructor.
Motivation
IncrementalStMan can save a lot of storage space.
Unlike the old StManMirAIO it stores the values directly in the
file to save on memory usage.
Example
This example shows how to create a table and how to attach
the storage manager to some columns.
SetupNewTable newtab("name.data", tableDesc, Table::New);
IncrementalStMan stman; // define storage manager
newtab.bindColumn ("column1", stman); // bind column to st.man.
newtab.bindColumn ("column2", stman); // bind column to st.man.
Table tab(newtab); // actually create table
Member Description
explicit IncrementalStMan (uInt bucketSize = 0, Bool checkBucketSize = True, uInt cacheSize = 1)
explicit IncrementalStMan (const String& dataManagerName, uInt bucketSize = 0, Bool checkBucketSize = True, uInt cacheSize = 1)
Create an incremental storage manager with the given name.
If no name is used, it is set to an empty string.
The name can be used to construct a
ROIncrementalStManAccessor
object (e.g. to set the cache size).
The bucket size has to be given in bytes and the cache size in buckets.
Bucket size 0 means that the storage manager will set the bucket
size such that it can contain about 100 rows
(with a minimum size of 32768 bytes). However, if that results
in a very large bucket size (>327680) it'll make it smaller.
Note it uses 32 bytes for the size of variable length strings,
so this heuristic may fail when a column contains large strings.
When checkBucketSize is set and Bucket size > 0
the storage manager throws an exception
when the size is too small to hold the values of at least 2 rows.
For this check it uses 0 for the length of variable length strings.
Copy constructor cannot be used.
IncrementalStMan& operator= (const IncrementalStMan& that)
Assignment cannot be used.