Hardware Recommendations

THIS PAGE IS OUT OF DATE! The CASA Team does not have the appropriate resources to maintain recommendations regarding hardware. Please proceed at your discretion.

The following document applies to CASA versions up to 5.6 (and planned 6.1 release) and is specific to VLA and ALMA data sets. Data from other instruments may require different configurations, particularly with respect to memory footprint during imaging.

For the current version of CASA the following three areas of system configuration need to be balanced to provide the most efficient performance versus cost:

file I/O
memory
processing power (CPU)

Note that the recommendations below consider both serial and parallel CASA use. This includes parallelizing individual tasks, improving serial performance of individual tasks, and converting some I/O routines to an asynchronous form. The pipeline does not use the full parallelization options that CASA currently offers.

File I/O

For parallel CASA, the imaging task will likely be I/O limited during gridding if data is stored on a single hard drive. New 6TB single hard drives can typically provide data at ~200MB/sec. This data rate is not sufficient to keep a dual processor 16 core machine supplied with data for some tasks. A good rule of thumb for the I/O subsystem is roughly 25 to 50MB/s per core.

For relatively small (< 100GB) data sets, Solid State Disks (SSD) may prove a cost effective solution. As SSD capacities continue to rise and prices continue to drop, they are becoming increasingly more attractive alternatives to spindle based drives.

For some data sets, particularly those larger than 300GB, a small RAID array of 3 or 4 hard drives is still potentially the most cost effective. The RAID can either be created in hardware via an onboard controller or simply done in software utilizing excess processing capacity within the existing CPU.

Intel has released a new model of NVMe (Non-Volatile Memory Express) flash based storage devices. These new devices support capacities up to 8TB, have random access times 10 to 100s of times those of SSD or spindle drives and aggregate throughput in the GB/s range. A 4TB Intel P3608 disk could effectively double the cost of a workstation but could improve performance of some CASA tasks by a factor of 2 or more.

CASA's parallelization approach requires that all nodes and cores have access to the complete Measurement Set (MS). For the workstation cases, this is trivially achieved by interacting with a local filesystem. For the cluster case, all systems must share access to a single filesystem. The NRAO has opted to use the distributed parallel filesystem Lustre. Any distributed filesystem will work, but performance can vary greatly between such simple approaches as NFS and more sophisticated approaches such as Lustre, Gluster or GPFS. Specifying a high-speed, parallel filesystem suitable for an 8-node, 16-node, or larger cluster is beyond the scope of this document.

Memory

There are two areas of memory consumption typically associated with CASA: the caching of raw visibilities and the storing of intermediate and final images.

The first is implicit consumption in the form of filesystem cache. If a system has more memory than the data set size, the data will be cached in memory and I/O rates are greatly improved for subsequent accesses. If the system has less memory than the data set size, then subsequent accesses will result in a cache miss (data will be evicted from cache in favor of more recently accessed data) and revert to disk access. It's typically more cost effective to invest in a faster I/O subsystem than sufficient memory to cache large (100GB+) data sets.

The second area is explicit consumption in the form of images. There must be sufficient memory to store all intermediate images, or swapping will occur which drastically reduces performance. Wide-band, wide-field and newer imaging algorithms, like A and W-projection, will consume a larger portion of memory.

We are still examining memory requirements as these algorithms are still in development. Currently we recommend a minimum of 8GB of memory per utilized processor core (e.g. fully utilizing dual processors with 8 cores each would require 128GB of system memory), for spectral line and narrow field narrow band continuum imaging. For more memory intensive algorithms 16GB or 32GB per core may be required.

The memory footprint for CASA proper is quite small: 500MB per parallel instance should be sufficient. For an 8 core node this results in a 4GB footprint plus whatever is needed for imaging.

Processing

We continue to examine the most effective balance of number of cores, number of nodes and speed of cores for a given cost and science case. Assuming the use of parallel CASA and sufficient I/O to keep all cores supplied with data, then in general spectral line imaging and most continuum imaging will benefit more from a larger number of cores than faster clock speed. Complex continuum imaging cases currently benefit more from faster clocks than number of cores due to serial portions in the imaging algorithm, but that may change in future CASA releases.

If the proposed system will be interacting with a single disk, then faster processor clock speed provides more benefit than total number of cores.

If the proposed system will have faster file access via Solid State Disks (SSD), local RAID arrays or high speed networked file systems, it may be worthwhile to invest in both faster and higher core count CPU (e.g., an Intel Xeon Gold 6136 3.0GHz 12-core CPU).

Even if a system will initially have a single processor, make certain the mother board and processor type can support dual processors for future expansion. Typically the CPU only represents 25% to 50% the cost of a system, so adding a second processor is more cost effective than adding another machine to a workgroup. For currently available serial versions of CASA, clock speed is the most important factor, for future parallel versions total core count becomes more critical for a given cost.

Future updates to this document are planned with the release of each new version of CASA. These updates will more clearly show the line where faster processors or higher core count systems have decreasing merit as a function of I/O susbsystem and imaging case. They will also further clarify processor options as a function of science case and data size.

CASA Hardware Recommendations

Current Recommendation

Some example configurations are listed below in a variety of price ranges. All configurations assume eventual parallel CASA use. The configurations also assume 8GB of memory per core and RAID array to keep the processors supplied with data. They are intended as guidelines into how all three susbsystems (I/O, memory and CPU speed plus core count) should be scaled together.

A comprehensive listing of the current generation of Intel Skylake processors can be found here. In general, the desktop variants i7-7800 and i9-7900 are single socket only. The workstation class E3-1220v5 and the server class Xeon Gold 6136 models are dual socket. NRAO has tested only one quad socket system with four older Sandy Bridge E5-4640 8 core processors. The non-linear increase in price of the base system and extreme difficulty supplying all 32 cores with data made 4 processor and higher systems an unattractive option.

RAID 5 is recommended over single disks. RAID-0 may have higher performance, but we do not recommend it due to the significantly higher risk of data loss.

**Suggested low-end workstation ~$3,000 USD**
Hardware	Notes
Single Intel I7-9700K 3.6GHz 8-core processor	Optionally consider single higher core count processor with a plan to add a second
64 GB RAM as four 16GB DIMMS	Best if system can accept at least 8 DIMMS to allow for future expansion
Three 3TB HGST 7200 RPM SATA Drives	Configured as 6TB Software RAID-5 (3+0) array

**Suggested medium workstation ~$6,000 USD**
Hardware	Notes
Single Intel Xeon Silver 5217 3.0GHz 8-core processor (8 cores total)
128 GB RAM as eight 16GB DIMMs four 32GB DIMMs
Three 8TB HGST 7200 RPM SATA Drives	Configured as 24TB Software RAID-5 (3+0) array

Due to Covid-19 supply chain complications, world wide memory prices are currently 30 to 60% higher than normal. The above configuration will likely cost $7,000 USD.

Suggested high-end workstation ~$9,000 USD
Hardware Notes

Dual Intel Xeon Silver 5217 3.0GHz 8-core processors (16 cores total)

256GB RAM as eight 16GB DIMMs or four 32GB DIMMs Best to leave some DIMM slots empty to allow for future expansion

Four 8TB HGST 7200 RPM SATA Drives Configured as 24TB Software RAID-5 (3+1) array

**Suggested high-end workstation ~$9,000 USD**
Hardware	Notes
Dual Intel Xeon Silver 5217 3.0GHz 8-core processors (16 cores total)
256GB RAM as eight 16GB DIMMs or four 32GB DIMMs	Best to leave some DIMM slots empty to allow for future expansion
Four 8TB HGST 7200 RPM SATA Drives	Configured as 24TB Software RAID-5 (3+1) array

Due to Covid-19 supply chain complications, world wide memory prices are currently 30 to 60% higher than normal. The above configuration will likely cost $11,500 USD.

Optional High Speed NVMe storage $1,300 to $3,700 USD
Hardware Notes

Confirm your system has the necessary PCIe slot before purchasing

Intel Optane SSD 905P 1.5TB PCIe NVMe $1,300 USD up to 2.6GB/s read and 2.2GB/s write

Intel DC P4618 6.4TB PCIe NVMe $3,700 USD up to 6.6GB/s read and 5.3GB/s write

**Optional High Speed NVMe storage $1,300 to $3,700 USD**
Hardware	Notes
Confirm your system has the necessary PCIe slot before purchasing
Intel Optane SSD 905P 1.5TB PCIe NVMe	$1,300 USD up to 2.6GB/s read and 2.2GB/s write
Intel DC P4618 6.4TB PCIe NVMe	$3,700 USD up to 6.6GB/s read and 5.3GB/s write

A cluster is simply 2 or more nodes connected through a high speed switch (10Gbit Ethernet or 40Gbit infiniband) to a central parallel filesystem. A central filesystem removes the need to repeatedly distribute/gather/distribute data to/from disks on individual nodes.

The NRAO EVLA and ALMA post processing clusters consists of multiple 1U rack mounted servers identical to the recommended high end work stations, with the exception that instead of local 4+1 RAID5 arrays each node connects via 40GBit Infinband to a Lustre filesystem consisting of multiple 4+2 RAID 6 arrays. There is roughly one 4+2 RAID6 array within the Lustre filesystem per cluster compute node.

Various parallel filesystem options exist. Documenting and recommending a specific filesystem is beyond the scope of this document. All require more labor and administration to properly maintain them. The likely options are Lustre (which NRAO uses), GlusterFS, GPFS (IBM commercial) or GFS2 (Redhat SAN based filesystem). For small 4 or 8 node clusters, a 10Gbit network, high speed raid array and simple NFS may be sufficient. Discussions with local IT support is highly advised before considering a cluster and parallel filesystem option.

Addition ALMA specific information can be found in the data processing section of the ALMA science web portal (login required)

Tuning considerations

Hyperthreading

In NRAO's experience CASA does not benefit from hyperthreading; in fact tests show a detrimental effect with hyperthreading turned on. We recommend hyperthreading be turned off in the BIOS.

Virtual Memory

Beginning with RHEL6, Redhat introduced "transparent hugepage support". This feature interacts badly with CASA and some newer Sandy Bridge motherboards. On affected machines, processes will periodically spike to 100% system CPU utilization as they very rapidly do nothing. The following command: echo "never" > /sys/kernel/mm/redhat_transparent_hugepage/enabled will disable this feature under RHEL6. It is no longer needed under RHEL7.

Disk I/O

Local disks and disk arrays may benefit from properly tuned block level read-ahead. This causes the disk subsystem to pre-fetch data from the disk when reading with the assumption the next I/O request will be a read for the next block(s). This can be set via /sbin/blockdev --setra {N sectors} /dev/sdX where N sectors is the number of 512 byte sectors to prefetch and /dev/sdX is the device. To set the read ahead to 4MBytes for the first partition of device sdb that would be /sbin/blockdev --setra 8192 /dev/sdb1. The value can be read with /sbin/blockdev --getra /dev/sdb1. The optimal value can vary as a function of file fragmentation and parallelization breadth (competing reads). In general 1MB to 4MB read ahead is generally better than the default value for CASA.

NVMe setup

To fully utilize all the capabilities of the new NVMe drives it's necessary to upgrate to a 4.x kernel. But stock RHEL 6.x systems can still make good use of their features. Intel has a nice set of documentation for configuring the drives.