Introduction
VAST provides a high performance file server implementation supporting the NFS, S3, and SMB protocols. Unlike some file systems, VAST does not require provide a client side driver. For all protocols we support the use of native operating system client drivers. For NFS, VAST does provide an optional high performance driver but its use is not required. Regardless when using the native NFS clients or VAST's option high performance NFS client, VAST is largely self tuning - requiring very little in the way of server side configuration. As such most of the performance tuning of a VAST system is directly related to the tuning and configuration of the file system client.
In this article we are going to focus on how to tune Linux NFS clients although other platforms will have similar options. We focus here on client behaviors and assume that the VAST system itself has sufficient total throughput and IOPS. VAST itself is linearly scalable and can support as much throughput and IOPS as needed by adding more storage enclosures (D-boxes) and/or protocol servers (C-nodes).
As you consider performance tuning of the NFS mounts, it is important to keep two things in mind. First, in most general purpose cases with a mixed workload, no tuning is required. Secondly, most NFS tuning choices result in tradeoffs. That means that tuning mounts for one application could easily make another run slower. If this type of precise tuning is deemed necessary for some applications, consider creating separate mounts for those applications to avoid affecting other applications.
When using VAST the most common areas of performance tuning are the following:
- Client to C-node balancing
- Per client NFS throughput limitations
- NFS attribute caching
- NFS write buffering
- NFS Read Ahead and Read Size
Important Note:
Ubuntu 20.04 and RHEL 8.3 introduce a kernel-default change which reduces NFS readahead from 15MB to 128k. This results in much lower sequential read performance. Please see the end of this doc for a script to check and set it (per mountpoint), or you can automate this with udev rules.
Client to C-node Balancing
Refer to Client to Protocol Server (C-node) Balancing for an explanation of how to balance client traffic to C-nodes.
Per Client NFS Throughput Limitations
NFS internally uses a standard TCP socket to connect from the client host to VAST. The TCP protocol stack is rather complicated and involves movement of data between kernel space and user space which introduces overhead. As a result, NFS client traffic from a single host over a single NFS mount is typically bottlenecked well below what a high speed network link can support. The maximum throughput is environment specific but we typically see a maximum throughput of around 2GB/sec read and 1GB/sec write for a single mount. If the network links are no where near saturation and more throughput is desired from a single client system, there are ways to improve it since the bottleneck is not the operating system or CPU - it's the protocol stack.
There are two ways to break this bottleneck: additional NFS mounts and NFS over RDMA (NFSoRDMA)
Additional NFS Mounts
Typically the primary bottleneck in the NFS client is the TCP socket, not the client itself. Thus one way to improve throughput from a single client is simply to increase the number of mount points to VAST. VAST's test team uses this technique frequently internally for throughput testing to simulate more clients without needing to deploy a large number of Linux systems.
It's easy enough to create multiple mounts to VAST and of course you'll need to consider the Client to C-node Balancing as mentioned earlier. The trickier part is getting your clients to use multiple mounts. There is no simple answer to that, but here are some techniques that VAST customers use:
- Each application uses its own mount point. That way if multiple applications are deployed to the same node, you end up implicitly with multiple mounts.
- Containerized applications that use a storage plugin such as VAST's VAST with Kubernetes plugin each get their own unique mount point, resulting in multiple mounts per host.
- Applications can be developed to support multiple paths to directories for higher performance. This is actually fairly common. Several popular load generation tools support this. For example:
- Elbencho can take multiple directories as parameters and has a parameter to create a number of subdirs per thread:
elbencho /mnt/vast/mydir{1..4} -n NUM_SUBDIRS ...
- The FIO directory parameter can take multiple directories, for example:
fio ... --directory /mnt/vast1/fiodir:/mnt/vast2/fiodir:/mnt/vast3/fiodir:/mnt/vast4/fiodir
- MDTEST when launched via the mpi framework can be used with multiple directories to spread load, for example:
mpirun ... -d /v/mount1/1@/v/mount2/2@/v/mount3/3@/v/mount4/4
- Elbencho can take multiple directories as parameters and has a parameter to create a number of subdirs per thread:
NFSoRDMA
RDMA is short for Remote Direct Memory Access which essentially means that most of the TCP protocol stack is bypassed and memory is transferred "directly" between client and server. The result is that RDMA in general and NFSoRDMA in particular can provide dramatically higher throughput. In our testing we find that throughput gets much closer to the theoretical network limits - for example 8GB/sec read over a 100GigE network link is achievable. Write throughput also improves although there are additional bottlenecks - for example using a 50GigE link read throughput is about 4GB/sec while write throughput is about 2GB/sec.
NFSoRDMA is a standard part of Linux although we generally recommend the use of the Mellanox NFSoRDMA support as described in Configuring Linux Server Machines as NFSoRDMA Clients.
VAST Multipath and nconnect
VAST has backported the nconnect feature to Linux clients starting with: RHEL-7.5, Ubuntu 18.04, SLES12. We have also enhanced the NFS client with "Multipath". This will increase NFS connections not only to one VIP, but to many. This also has support for multiple client-local interfaces. It can easily saturate network interfaces. It works with or without RDMA. Contact VAST support for client-driver.
NFS Client Attribute Caching
NFS clients cache attributes that they have read from servers. This means that operations that check values of attributes (file size, ownership, modification time, etc.), such as stat calls (used by 'ls -l' for example) can return out of date information. Typically well designed applications are insensitive to these types of issues, but some applications require a high degree of synchronization between clients in terms of their common view of the file system. In that case, the standard NFS attribute caching behaviors can be tuned.
Examples of the Problem
Status File Coordination
One example, of applications that may require NFS tuning are applications that create lock or status files and then various clients stat those files at regular intervals. If the directory cache lifetime is too long, the client may timeout before the directory cache expires and the new file becomes visible. Consider the example of a client that checks every 10 seconds for a file and then times out after 30 seconds. By default the directory cache is held for 60 seconds. The following could happen:
- client 1 checks the directory and cause the directory to be cached by the operating system
- client 1 notices that the file is not there
- some other client (perhaps client 2) on another host creates the file
- client 1 waits 10 seconds, try again, the cache still shows no file
- client 1 repeats many times before the file shows up. If the timeout is 30 seconds, the file will not be discovered in time.
Another scenario that is similarly impacted is when there is a coordination file that is deleted by one client, and then another client looks for that file. File stat will say the file exists (directory is cached) but if the client goes to read the file, it will fail since it has actually been deleted. Most likely the client will see a ESTALE or NFS Stale Filehandle error.
Both of these scenarios can be prevented, thus avoiding the use of costly actimeo, acdirmax, acregmax mount options described later in this doc. By simply create/lock/unlock/remove a file (with any name) in the directory you want, you will flush the attribute cache in that directory, and the next stat of your control file will be from the file server. C-code example here: dirsync_nfs.c Same trick can be implemented in Perl, Python, tcl, or bash, using flock or fnctl locking.
File Watching
Some tools, such as 'tail -f' look at file attributes to determine if a file has changed. They might look at size or time. By default, NFS clients also cache file attributes. As such tools like this may exhibit odd delays and lower performance if not designed properly for a clustered environment. Here again, NFS client attribute caching can be tuned. Just keep in mind that the content of a write can also be buffered as discussed in the NFS Client Write Buffering section.
Tuning Attribute Caching
It is best to change applications to work better with a distributed file system, but we can also tune the NFS caching behaviors as described in this section. Just keep in mind that reducing caching (which is a mount level setting) can have negative impacts on client performance. Since VAST is an all flash storage system with low latency, the impact will be far less than with a traditional system, but there is still impact. It's best to tune and test these values to find what works best for your environment.
The full reference on NFS attribute caching can be found in the standard Linux documentation, for example: https://linux.die.net/man/5/nfs . Here we highlight the more interesting parameters used by our customers.
We won't cover it here, but these are standard mount options that can be added to the mount command used to mount the VAST file system on each client host.
Directory Attributes
acdirmin and acdirmax control how long directory attributes are cached. The defaults are typically 30 and 60 seconds respectively. If clients depend on files existence and deletion showing up quickly on other nodes, these values can be reduced.
File Existence Caching
One special case of directory attribute caching is file non-existence caching. The NFS client by default maintains a "negative entry cache." The negative entry cache was introduced because there are some applications that try to access non-existing files again and again. This means that until the directory cache entry expires, once an application checks to see if a file exists, future checks will continue to show it doesn't exist, even though it does!
Consider this example:
node01$ cat /mnt/nfs/myfile
cat: /mnt/nfs/myfile: No such file or directory
#on a different node create the file
node02$ echo "Hello World" > /mnt/nfs/myfile
#back to original node
node01$ cat /mnt/nfs/myfile
cat: /mnt/nfs/myfile: No such file or directory
What just happened? We checked whether a file existed on the shared NFS mount on node01. It did not - so far so good. Then we created it on node02 and tried to open it again afterwards on node01. Then node01 reported again that the file does not exist even though it does. You probably already guess the reason for this: The negative entry cache of node01 reported that the file does not exist.
To avoid this problem, simply use the "-o lookupcache=positive" mount option for NFS, so that the NFS client only caches existing files.
Just keep in mind that this means every check for existence of a file will trigger a call to the NFS server. If the applications are badly behaved with rapid checks, that could be bad as well. If that happens, tuning the acdirmin/max values may be more sensible. Or best off, use the dirsync_nfs.c wrapper described above.
File Attributes
acregmin and acregmax control how long attributes for a file are cached. The defaults are typically 3 and 60 seconds respectively. If clients that check file contents exhibit odd delays (for example 'tail -f') this attribute may be impacting them. To make file attributes (modification time, size, etc.) visible more quickly, reduce both of these values, although this puts more metadata load on the NFS servers. It is much more beneficial to identify the section of code in the application which needs these "realtime stat" functions, and code around the issue with a solution described below which flushes inode-cache on the current directory so subsequent stats will be from the most recent change on the filer.
noac is NOT recommended, but can be used to completely disable file attribute caching. This obviously ensures maximum synchronization but can have severe performance impacts since even simple commands such as 'ls -l' will be forced to make many round trip calls to get file information. noac also implicitly disables any write buffering in the clients.
actimeo controls the timeout values for directory and file attributes. Setting actimeo sets acdirmin, acdirmax, acregmin, and acregmax to the same value in seconds.
NFS Client Write Buffering
Unlike many NFS accessible file systems, VAST does not buffer. All received writes are sent immediately to persistent storage - specifically Optane. One interesting side effect of this is that the NFS hint for synchronous writes to stable storage has no meaning. VAST writes are always stable.
Even though VAST does not buffer, that does not mean that there is no buffering. The client operating system may well buffer writes. Most Linux systems, and applications enjoy buffered-IO via the standard linux VFS-Buffercache. NFS writes are buffered, and gathered to combine small IOs into larger IOs. This generally improves performance by reducing network round trips and reducing the number of small writes to disk.
Buffering Has a Downside
Sometimes client side buffering can cause problems. In addition to the obvious risk of data loss should the client system fail, buffering introduces additional issues:
- write buffering makes cross client parallel file operations difficult to keep consistent - one client can write data and another can't yet read it as the data is still buffered. Applications that are used to parallel file systems with their consistent cross client semantics will may want to disable buffering, or change their codes to deal with asynchronous NFS semantics of caching.
- write buffering will also confuse clients and applications when files or data are being accessed (read or write) from other clients. Application authors can implement techniques such as NFS advisory locking to prevent concurrent updates, and/or to flush an inode cache to enable accurate stat results of a file.
File Attribute Caching and High Latency
The downsides of write caching and how it interacts with attribute caching require a bit of explanation. In Linux, when a stat request is made against a file or directory, the operating system ensures it returns current and accurate information - this includes modification times and sizes. The operating system relies on getting this information from the NFS server. The problem is that if operating system has buffered writes to the files or directories in question, it will hold the stat call until the current write buffers for those file system objects are pushed to the NFS server. And since write ordering must be preserved, the system will block the stat call until all current buffered writes from this client have been pushed.
If the amount of buffering is large, the amount of pending IO is large, and the VAST cluster is overloaded, it can take quite a long time to push all of the pending writes. This can result in very long stat times (impacting tools such as 'ls -l'). While unlikely 'ls -l' of a directory with many files being actively written to can take seconds or even minutes while the buffer is pushed. You can see this happening if you collect network traces. What you will see is that the NFS RPCs for stat will not be sent over the network for a long time after the stat operation was initiated by the client.
What's really odd about this behavior, is it impacts only the clients doing the writing, since only those clients know about the pending writes. Another client executing the stat call will get a response quickly (with stale information).
A possible workaround is here: ls -l from a different VIP
This behavior, while probably surprising, is not a bug in VAST or Linux. It is by design in Linux. Here are a few helpful references:
- https://access.redhat.com/solutions/21581
- https://access.redhat.com/articles/2252871
- https://access.redhat.com/solutions/2249321 ( a kernel fix that makes ls -l handling more intelligent when there is lots of pending buffered IO)
- https://bugzilla.redhat.com/show_bug.cgi?id=469848
- http://www.pocketnix.org/posts/Bursty%20NFS%20Transfers
- https://lwn.net/Articles/682582/
The way to prevent these long hangs under very high load is to reduce NFS write buffering.
Operating System Buffer Tuning
You might consider reducing the foreground or background dirty page buffering. We aren't going to go into detail on these items, but know that you can set the following linux kernel parameters as documented in this sysctl document:
- dirty_background_bytes or dirty_background_ratio - the ratio is the amount of dirty bytes allowed as a percentage of memory, while bytes is a specific value. If one is set, the other is ignored and implicitly set to 0.
- dirty_bytes or dirty_ratio - same as above
Reducing the values here from the default values on your system will reduce the amount of pending IO that can build up before the operating system flushes data to VAST. There is no hard and fast recommendation. It's best to start with the current values and your workload and adjust them. Just remember that there are tradeoffs. Some types of IO will get faster, others will get slower.
Here are example settings that we've used successfully for some benchmarking - the best values for you will require your testing:
#limit background buffering to 3% of available memory.
#obviously the best ratio depends heavily on the amount of memory available. In our case we had 128GB
#and we found that a 4GB buffer was better over the default of 10% for 13GB.
sudo sysctl vm.dirty_background_ratio=3
#limit foreground bytes buffered to 512MB
sudo sysctl vm.dirty_bytes=536870912
NFS Read Ahead / Prefetch / Read Size
NFS has a few optimizations to improve read performance that relate to anticipating what will be read. NFS has a minimum read size (rsize on the NFS mount options, VAST sets to 1MB by default) and a prefetch or read ahead size. The default varies depending on the kernel and version. We've seen it as low as 128K and as high as 15MB. High values given better performance if the applications ultimately read most of the data prefetched. On the other hand, low values work better if the client lacks the needed buffering or there is highly random IO and prefetched data is not likely to be read. You need to understand your workloads to properly tune the prefetch behavior.
Important note: RHEL 8.3 and Ubuntu 20.04 default to a much smaller NFS readahead which limits performance. More details here https://access.redhat.com/solutions/5953561
Below is a script to check and set nfs_readahead. All cases are different, but 15MB (the old default is much better than 128KB (the new default). This can be set automatically with udev rules.
Prefetch is too Large
A sure sign that unnecessary prefetching is occurring is to note the throughput reported by the client application and compare to what is reported by VAST. For example, if IOR is being run and it reports a throughput much lower than the VAST reported throughput, that indicates that the NFS client is fetching data not needed by the client application (IOR in this example). Another example might be an application that reads only a few bytes of a file and then looks at a different file.
If you suspect NFS prefetch or minimum read size is excessive, it can be reduced. The minimum read size (rsize) is an NFS mount option which VAST defaults to 1MB, but you can make it a smaller value if that's appropriate for your application.
Keep in mind that reducing the prefetch will help for Random IO workloads but it will hurt sequential IO workloads.
Prefetch is too Small
Of course you can have the opposite problem. If you notice your applications are stalled waiting for IO (even flash over fast networks isn't infinitely fast) and you know their IO patterns are sequential, increasing the prefetch value can be beneficial. We've found that since newer Linux versions have changed that default prefetch size to 128K that tuning prefetch to a larger value is often very useful.
Tuning Prefetch
Prefetch can also be tuned. There is a system tunable that changes the amount of NFS read ahead. This is well explained in Tuning NFS Client Read Ahead. That article is for SUSE but it applies to CentOS as of this writing. Here is an example script that supports changing the tunable values as well as printing the current value.
#!/bin/sh # usage: To display current value: set-ra.sh </mount/point> # To set a new value: set-ra.sh </mount/point> <new_value> case $# in 1) cat /sys/class/bdi/0:`stat -c '%d' "$1"`/read_ahead_kb ;; 2) echo $2 > /sys/class/bdi/0:`stat -c '%d' "$1"`/read_ahead_kb ;; esac
Changing Prefetch Defaults
How to persistently set read-ahead for NFS mounts using nfs.conf
RHEL8.7 and above (from nfsutils-2.6.2) you can use /etc/nfs.conf to set the readahead. https://man7.org/linux/man-pages/man5/nfsrahead.5.html.
[nfsrahead] nfs=15000 # readahead of 15000 for NFSv3 mounts nfs4=16000 # readahead of 16000 for NFSv4 mounts default=128 # default is 128
How to persistently set read-ahead for NFS mounts using udev. (for RHEL8.6, Ubuntu 22.04 and older):
# create /etc/udev/rules.d/99-nfs.rules with the following content:
SUBSYSTEM=="bdi", ACTION=="add", PROGRAM="/bin/awk -v bdi=$kernel 'BEGIN{ret=1} {if ($4 == bdi) {ret=0}} END{exit ret}' /proc/fs/nfsfs/volumes", ATTR{read_ahead_kb}="15380"
# apply the udev rule:
udevadm control --reload
Comments
0 comments
Please sign in to leave a comment.