By default, Linux clients mount NFS connections to storage servers using TCP as the underlying transport protocol. This default mount configuration sets up a single TCP connection between the client port and a single storage port. This mode of NFS client mounting is the simplest to use, requiring no additional installations on top of any Linux distribution. However, the performance of the connection is limited by the use of a single TCP socket on the client side and by the use of a single storage port on a single CNode on the cluster side. Typically, we see a bandwidth of around 2 GBps on NFS/TCP mount points.
Several features are available that leverage NFS client connections to provide high speed connections for heavier compute workloads. These features are all supported with NFSv3. Some are supported with NFSv4.1 also. They can be used in any combination to increase the performance of an NFS client mount point. Native client support for these features varies with each feature and each Linux distribution. VAST provides an NFS software stack, VAST-NFS, which enables you to build a driver on your client to support the features you want to use on top of your Linux distribution. Read on for a brief synopsis of each feature's advantages and guidelines for setup and usage:
Supported for both NFSv3 and NFSv4.1, nconnect is a mount option that tells the client to open multiple connections to the destination. Up to 16 TCP connections can be created between the client and a single storage port (a single VIP on the cluster). Using nconnect over TCP was found to increase the bandwidth on a single mount up to 7GBps. IOPS remain unchanged since one CNode is in use as with the standard TCP mode.
nconnect is available out-of-the-box in relatively new Linux distributions, such as from RHEL 8.3 and Ubuntu 20.04.
On older Linux distributions, such as RHEL 7.X, VAST-NFS client driver is required to add nconnect support. For installation instructions, see VAST-NFS documentation.
Remote Direct Memory Access (RDMA) is a protocol that allows for a client system to copy data from a storage server’s memory directly into that client’s own memory. This allows the client system to bypass many of the buffering layers inherent in TCP. This direct communication improves storage throughput and reduces latency in moving data between server and client. It also reduces CPU utilization on both the client and the storage server.
Mounting over RDMA as the transport protocol, instead of TCP, increases the throughput by bypassing the TCP socket limitations. Mounting over RDMA with a single connection between one client port and one storage port, can raise the bandwidth up to as much as 9 GBPS.
RDMA is highly recommended for Infiniband networks where TCP traffic is costly in interrupts, and therefore in CPU.
NFSoRDMA is available in VAST-NFS for RHEL 7.X. and is highly recommended over the built-in version as it contains the most recent bug fixes and optimizations. It is also available and known to work well for RHEL 8.X.
NFSoRDMA is supported for NFSv3. From VAST Cluster 4.3, NFSoRDMA is also supported for NFSv4.1.
An RDMA-capable NIC on the client host, such as Mellanox ConnectX series.
Infiniband or RDMA over Converged Ethernet (RoCE)
With RoCE, a lossless Ethernet network in order to reach performance characteristics similar to those found with RDMA on InfiniBand networks. We recommend using a flow control mechanism and Explicit Congestion Notification (ECN) when running RDMA over RoCE. It might be best to stick with TCP on lossy networks.
Recommended: Mellanox OpenFabrics Enterprise Distribution (MLNX_OFED) driver. This may not be mandatory for modern distributions.
In some distributions, MLNX_OFED disables NFSoRDMA. In such cases, VAST-NFS is required to reenable NFSoRDMA.
Follow the instructions at How To Install MLNX_OFED Driver to install the latest version of MLNX_OFED.
Make sure to include the following command line options:
# ./mlnxofedinstall --add-kernel-support --force
Verify that MLNX_OFED is loaded successfully before proceeding to install VAST-NFS.
See VAST-NFS documentation for instructions.
The following flow control mechanisms are supported:
PFC, for clusters with dual NIC CNodes, deployed with one NIC in use for internal traffic and the other in use for external traffic.
Global pause, which is available with both single NIC and dual NIC CNodes.
Both require a lossless configuration on the switch. PFC must be configured on the client and also on the cluster's CNodes.
To configure PFC on the CNodes, run the configure_network.py script. with the
Multipath was developed to enable unlimited performance for NFS clients. Multipath enables a client to open multiple connections from multiple ports to multiple addresses. It is supported for NFSv3 and will be supported in future versions for NFSv4.1.
Multipath differs greatly from NFSoRDMA and nconnect in scale: instead of creating one-to-one connections between clients and CNodes, multipath creates several connections across multiple CNodes and yields good load distribution in the cluster.
Multipath was demonstrated with NVIDIA DGX platforms reading a whopping 162GBps (that's GB not Gb) which is the maximum a DGX can transfer on its PCI switches.
Multipath requires installing the VAST-NFS client. See the VAST-NFS User Guide for instructions.
To mount a view with multipath, pass the following parameters in the mount command:
remoteports=<START_IP>-<END_IP>. This parameter specifies a range of IPv4 addresses on the remote cluster to use for outbound connections. Specify IP addresses in a VIP pool configured on the cluster.
nconnect=N. This parameter determines how many connections the client creates from the range of valid addresses specified by
Nis the number of connections. The minimum value is 1.
To saturate a 100Gbps NIC we recommend
nconnect=4with RDMA or
Specify the entire range of virtual IPs with
remoteportsin order to achieve even load distribution across all CNodes.