CASA Hardware Considerations

The following document applies to CASA versions up to 4.6.0 and is specific to VLA and ALMA data sets. Data from other instruments may require different configurations, particularly with respect to memory footprint during imaging.

For the current version of CASA the following three areas of system configuration need to be balanced to provide the most efficient performance versus cost:

  1. file I/O
  2. memory
  3. processing power (CPU)

Note that parallel CASA is still in development; this involves parallelizing individual pipeline tasks, improving serial performance of individual tasks and converting some I/O routines to an asynchronous form. The recommendations below considers both serial and future parallel CASA use.

File I/O

The CASA team is working to improve the I/O efficiency by coalescing reads them into larger bulk requests and reducing the number of iterations through an MS and investigating methods to enable aynchronous I/O calls and improve explicit caching of frequently accessed table data.

For parallel CASA, even after the above improvements are completed, the imaging task will likely be I/O limited during gridding if data is stored on a single hard drive. New 6TB single hard drives can typically provide data at ~200MB/sec. This data rate is not sufficient to keep a dual processor 16 core machine supplied with data for some tasks. A good rule of thumb for the I/O subsystem is roughly 25 to 50MB/s per core.

For relatively small (< 100GB) data sets Solid State Disks(SSD) may prove a cost effective solution. As SSD capacities continue to rise and prices continue to drop they are becoming increasingly more attractive alternatives to spindle based drives

For some data sets, particularly those larger than 300GB a small RAID array of 3 or 4 hard drives is still potentialy the most cost effective. The RAID can either be created in hardware via an onboard controller or simply done in software utilizing excess processing capacity within the existing CPU.

Intel has released a new model of NVMe (Non-Volatile Memory Express) flash based storage devices. These new devices support capacities up to 4TB, have random access times 10 to 100s of times SSD or spindle drives and aggregate throughput in the GB/s range. A 4TB Intel P3608 disk could effectively double the cost of a workstation but could improve CASA performance by a factor of 5 to 10.

CASA's parallelization approach requires that all nodes and cores have access to the complete Measurement Set (MS). For the workstation cases this is trivially achieved by interacting with a local filesystem. For the cluster case all systems must share access to a single filesystem. The NRAO has opted to use the distributed parallel filesystem Lustre. Any distributed filesystem will work but performance can vary greatly between such simple approaches as NFS and more sophisticated approaches such as Lustre, Gluster or GPFS. Specifying a high speed parallel filesystem suitable for an 8 node, 16 node or larger cluster is beyond the scope of this document.

Memory

There are two areas of memory consumption typically associated with CASA: the caching of raw visibilities and the storing of intermediate and final images.

The first is implicit consumption in the form of filesystem cache. If a system has more memory than the data set size the data will be cached in memory and I/O rates are greatly improved for subsequent accesses. If the system has less memory than the data set size then subsequent accesses will result in a cache miss (data will be evicted from cache in favor of more recently accessed data) and revert to disk access. It's typically more cost effective to invest in a faster I/O subsystem than sufficient memory to cache large (100GB+) data sets.

The second area is explicit consumption in the form of images. There must be sufficient memory to store all intermediate images or swapping will occur which drastically reduces performance. Wide-band, wide-field and newer imaging algorithms like A and W-projection will consume a larger portion of memory.

We are still examining memory requirements as these algorithms are still in development. Currently we recommend a minimum of 4GB of memory per utilized processor core (e.g. fully utilizing dual processors with 8 cores each would require 64GB of system memory), for spectral line and narrow field narrow band continuum imaging. For more memory intensive algorithms 8GB or 16GB per core may be required.

The memory footprint for CASA proper is quite small: 500MB per parallel instance should be sufficient. For an 8 core node this results in a 4GB footprint plus whatever is needed for imaging.

Processing

We are still examining the most effective balance of number of cores, number of nodes and speed of cores for a given cost and science case. Assuming the use of parallel CASA and sufficient I/O to keep all cores supplied with data, then in general spectral line imaging and most continnum imaging will benefit more from a larger number of cores than faster clock speed. Complex continuum imaging cases currently benefit more from faster clocks than number of cores due to serial portions in the imaging algorithm but that may change in future CASA releases.

If the proposed system will be interacting with a single disk, then faster processor clock speed provides more benefit than total number of cores.

If the proposed system will have faster file access via Solid State Disks (SSD), local RAID arrays or high speed networked file systems it may be worthwhile to invest in both faster and higher core count CPU, (e.g a Intel 2.6Hz E5-2640 v3 8 core CPU).

Even if a system will initially have a single processor make certain the mother board and processor type can support dual processors for future expansion. Typically the CPU only represents 25% to 50% the cost of a system so adding a second processor is more cost effective than adding another machine to a workgroup. For currently available serial versions of CASA clock speed is the most important factor, for future parallel versions total core count becomes more critical for a given cost.

Future updates to this document are planned with the release of each new version of CASA. These updates will more clearly show the line where faster processors or higher core count systems have decreasing merit as a function of I/O susbsystem and imaging case. They will also further clarify processor options as a function of science case and data size.


CASA Hardware Recommendations

Current Recommendation

Some example configurations are listed below in a variety of price ranges. All configurations assume eventual parallel CASA use and therefore include dual processors. The configurations also assume ~4GB of memory per physical core and RAID array to keep the processors supplied with data. They are intended as guidelines into how all 3 susbsystems (I/O, memory and CPU speed plus physical core count) should be scaled together. 128GB of memory in a low end system, 8GB of memory in a high end system or single disks in any of them is probably not cost effective.

A comprehensive listing of the current generation of Intel Haswell processors can be found here. In general the desktop variants i5-4100 to i5-4600 and i7-47000 to i7-5900 are single socket only. The server class E5-1600v3 and E5-2600v3 models are dual socket. NRAO has tested only one quad socket system with four older Sandy Bridge E5-4640 8 core processors. The non-linear base system price increase and extreme difficulty supplying all 32 cores with data made 4 processor and higher systems an unattractive option.

RAID 5 is recommended over single disks. RAID-0 may have higher performance, but we do not recommend it due to the significantly higher risk of data loss.

Recommended low end workstation ~4K USD
Notes, limitations
Single Intel E5-1620v3 3.5GHz 4 core processors Optionally consider single higher core count processor with a plan to add a second
32 GB RAM as four 8GB DIMMS Best if system can accept 8 DIMMS to allow for future expansion
Three 3TB Western Digital RE 7200 RPM SATA Drives Configured as 6TB Software 2+1 RAID-5 array
Recommended medium workstation ~6K USD
Dual Intel E5-2620v3 2.4GHz 6 core processors (12 cores total)
32 GB RAM as four 8GB DIMMS May need 64GB RAM depending on imaging case
Four 4TB Western Digital RE 7200 RPM SATA Drives Configured as 9TB Software 3+1 RAID-5 array
Recommended high end workstation ~9K USD
Dual Intel E5-2640v3 2.6GHz 8 core processors (16 cores total)
64 GB RAM as four 16GB DIMMS Best if system can accept 8 DIMMS to allow for future expansion
Four 6TB Seagate 7200 RPM SATA Drives Configured as 18TB Software 3+1 RAID-5 array
Optional High Speed NVMe storage 3.6K to 7.5K USD
Confirm your system has the necessary PCIe slot before purchasing
Intel 1.6TB P3608 SSDPECME016T401 NVMe PCIe SSD $3.6K USD ~1.6GB/s sequential read/write
Intel 4TB P3608 SSDPECME040T401 NVMe PCIe SSD $7.5K USD ~4.5GB/s sequential read/write

A cluster is simply 2 or more nodes connected through a high speed switch (10Gbit Ethernet or 40Gbit infiniband) to a central parallel filesystem. A central filesystem removes the need to repeatedly distribute/gather/distribute data to/from disks on individual nodes.

The NRAO EVLA and ALMA post processing clusters consists of multiple 1U rack mounted servers identical to the recommended high end work stations with the exception that instead of local 4+1 RAID5 arrays each node connects via 40GBit Infinband to a Lustre filesystem consisting of multiple 4+2 RAID 6 arrays. There is roughly one 4+2 RAID6 array within the Lustre filesystem per cluster compute node.

Various parallel filesystem options exist. Documenting and recommending a specific filesystem is beyond the scope of this document. All require more labor and administration to properly maintain them. The likely options are Lustre (which NRAO uses), GlusterFS, GPFS (IBM commercial) or GFS2 (Redhat SAN based filesystem). For small 4 or 8 node clusters a 10Gbit network, high speed raid array and simple NFS may be sufficient. Discussions with local IT support is highly advised before considering a cluster and parallel filesystem option.

Addition ALMA specific information can be found in the data processing section of the ALMA science web portal


External RAID configurations

In some cases it may be more advisable to use an external RAID array to retrofit an existing system or to avoid disk capacity limits in smaller workstation cases.

Assuming the system has a spare PCI-E slot, an eSATA-II (external SATA) or eSAS controller is needed to connect the RAID array to the host computer.

For the disk array we recommend this 5 disk external RAID enclosure.

For 2TB to 4TB hard drives we recommend the Western Digital RE (RAID Enterprise version 4) model drives. With roughly 1000 drives comprising NRAO Lustre and data archive arrays we have good experience with their performance and reliability. The 2TB WD2000FYYZ, 3TB WD3000FYYZ and 4TB WD2000FYYZ drives are the best candidates. The 4TB are roughly 50% faster than the 2TB models due to newer microcontrollers and higher aereal data density. For 6TB drives the NRAO has adopted the 15K RPM Seagate ST9300653SS SAS drives and the 7.5K RPM ST6000NM0125 SATA drives.


Tuning considerations

Hyperthreading

In NRAO's experience CASA does not benefit from hyperthreading; in fact tests show a detrimental effect with hyperthreading turned on. We recommend hyperthreading be turned off in the BIOS.


Virtual Memory

Beginning with RHEL6, Redhat introduced transparent hugepage support. This feature interacts badly with CASA and some newer Sandy Bridge motherboards. On affected machines, processes will periodically spike to 100% system CPU utilization as they very rapidly do nothing. The following command: echo "never" > /sys/kernel/mm/redhat_transparent_hugepage/enabled will disable this feature.


Disk I/O

Local disks and disk arrays may benefit from properly tuned block level read-ahead. This causes the disk subsystem to pre-fetch data from the disk when reading with the assumption the next I/O request will be a read for the next block(s). This can be set via /sbin/blockdev --setra {N sectors} /dev/sdX where N sectors is the number of 512 byte sectors to prefetch and /dev/sdX is the device. To set the read ahead to 4MBytes for the first partition of device sdb that would be /sbin/blockdev --setra 8192 /dev/sdb1. The value can be read with /sbin/blockdev --getra /dev/sdb1. The optimal value can vary as a function of file fragmentation and parallelization breadth (competing reads). In general 1MB to 4MB read ahead is generally better than the default value for CASA.

NVMe setup

To fully utilize all the capabilities of the new NVMe drives it's necessary to upgrate to a 4.x kernel. But stock RHEL 6.x systems can still make good use of their features. Intel has a nice set of documentation for configuring the drives.

Last modified by James Robnett September 21st, 2016

Staff  |   Policies  |   Diversity