The Vast Probe is software which scans existing datasets, and can provide insight into the potential data reduction which will apply to those datasets if they were ingested onto a VAST cluster. You can learn more about the details of how the probe works by reviewing Running the Probe. You can learn how to quickly run the probe by following the instructions in Vast Probe Quickstart.
The purpose of this document is to help you understand the probe hardware and software requirements. These are intended for customers that are running the probe on their existing systems. Vast also provides the option of shipping a pre-built server which is optimized for the Probe.
Very briefly the software ony probe ships as a docker image and thus the only hard requirement to run the probe is docker and a file system to be scanned. We do however recommend that python3 be installed. This enables the use of the easy to use probe launcher as described in Vast Probe Quickstart.
The actual hardware requirements depend on the amount of data to be scanned. However, we strongly recommend the following minimums for a system that is running the probe. This will enable you to scan reasonable amounts of data reasonably quickly.
- At least 16 CPU cores
- 128GB or more RAM
- 10gigE or faster networking
- DAS or nVME attached SSD(s) or more RAM
- Space is needed for the database the probe builds. This can be in memory or on disk, thus the requirement is somewhat complex as will be explained in a moment
- This must be equivalent to 0.6% of the data to be scanned
- Disk storage (if used) must have very high sustained IOPs
Lets say you have a server with 768GB of RAM:
- 154GB is taken for the OperatingSystem (leaving 614GB)
- 50-bytes per 'filename' are taken. If you have 100-million files, that will occupy ~5GB of RAM (leaving 609GB)
- The remaining 609GB is available for the RAM index size (--ram-index-size-gb). Based on a 0.6% rule (to accommodate similarity and dedup hashes), that means you can scan up to 99TB of data using just RAM.
- Using a disk index you can scan far more data and the file count could exceed 10 billion (500GB file name cache)
Lets say you have a server with 128GB of RAM and plenty of local SSD:
- 26GB is taken for the OperatingSystem (leaving 102GB)
- 50-bytes per 'filename' are taken. If you have 100-million files, that will occupy ~5GB of RAM (leaving 97GB)
- The remaining 97GB is available for the RAM index size (--ram-index-size-gb). Based on a 0.6% rule (to accommodate similarity and dedup hashes), that means you can scan up to 15TB of data using just RAM.
- Using a disk index you can scan far more data and the file count could be as high as 2 billion (100GB file name cache)
- 15TB of data requires 90GB of fast disk
- 100TB of data requires 600GB of fast disk
Here's pseudo-code which helps to explain how these calculations are done:
available_ram_bytes = (avail_b * 0.8) - (n_files * 50)
ram_index_size = args.ram_index_size_gb * GB
disk_index_size = args.disk_index_size_gb * GB
if disk_index_size == 0:
if ram_index_size == 0 and available_ram_bytes > index_size:
ram_index_size = index_size
if ram_index_size == 0:
disk_index_size = index_size
if 0 < ram_index_size < GB:
ram_index_size = GB
if 0 < disk_index_size < GB:
disk_index_size = GB
OS & Software
We've tested the following, but most modern linux distributions should be fine:
- Ubuntu 17.10
- Centos/RHEL 7.4, 7.5, 7.6
Docker version 17.05 + (although older rev's may work as well). Running within a Singularity container has also been found to work, however it may require additional testing. On CentOS, the docker client version will be reported as this:
Docker version 1.13.1, build 7f2769b/1.13.1
Ensure that the probe-host/IP is given RO-Root access, so that it can read from any file. For Lustre/GPFS: ensure that the container can read as a root user.