Problem:
An alert indicates one or more nodes do have not enough space left on one of the local filesystems.
The local filesystems that are commonly referenced are:
- /vast
- /vast/data
- /userdata
An example of the alert typically received:
[2021-05-27 23:59:36,784: WARNING/ForkPoolWorker-106684/11277] Alarm (CRITICAL): LOGDOCKER - dnode-101 (10.100.100.100) [dnode001] - 2021-05-27T23:59:12.125913+00:00 ALERT[P9:E69:S25:F15 time="2021-05-27 23:59:12.125372629"]: trace dumper env_id=74: not enough space left on device /vast/data/traces/env (available 4280463360)
- Other common error messages may include:
-
Not enough space left on device /vast/data/traces/env
-
/userdata partition available space dropped below 15G
-
We recommend running the procedures in the solution section below to free up the space.
Summary:
Check to see if the space usage matches what is expected:
#check "/" partition, validate we have at least 10% free:
clush -a 'df -h / |grep -v Avail' |sort -h -k 5
#check "/userdata" partition, validate we have at least 15 GB available
clush -a 'df -h /userdata |grep -v Avail' |sort -h -k 5
# check "/vast" partition, validate we have are least 10% free:
clush -a 'df -h /vast |grep -v Avail' |sort -h -k 5
Solution:
In order to address the alarm, we suggest cleaning up old trace data that may no longer be required.
To remove by a number of days (30 days):
clush -g dnodes 'sudo find /vast/data/metrics/ -type f -mtime +30 -delete'
clush -g dnodes 'sudo find /vast/data/traces/env/ -type f -mtime +30 -delete'
To specify a specific date and start time (e.g. - 2022-11-22 at 11:00:00):
clush -g dnodes 'sudo find /vast/data/metrics/ -type f -not -newermt "2022-11-22 11:00:00" -delete'
clush -g dnodes 'sudo find /vast/data/traces/env/ -type f -not -newermt "2022-11-22 11:00:00" -delete'
An example workflow of clearing out space on the DNode and addressing these low space alerts would look like this:
- Start by checking the following to assess the amount of space:
clush -g dnodes "df -h /vast | grep -v Mounted" clush -g dnodes "df -h /userdata | grep -v Mounted"
- If the problem is in /userdata, it will generally be old bundles or install files. An example of this would look like this:
/userdata/bundles/bundle-xxxxxx /userdata/release-* /userdata/bundles/upgrades/*
- Most often, it's the /vast partition. If that's the case, you can start with the following:
find /vast/data/metrics/ -type f -not -newermt '2022-10-01 00:00:00' -delete
(Note: You can adjust the date, but DNodes generally do not need metrics for further back than a week or two.) - Next, you can run the following commands:
clush -g dnodes "rm /vast/data/traces/env/2020*" clush -g dnodes "rm /vast/data/traces/env/2021*" clush -g dnodes "rm /vast/data/traces/env/20220*" clush -g dnodes "rm /vast/data/traces/env/202210*" clush -g dnodes "rm /vast/data/traces/env/202211*"
clush -g dnodes "rm /vast/data/traces/env/202212*"
clush -g dnodes "rm /vast/data/traces/env/202301*" - Check the below command in between each of the above lines:
clush -g dnodes "df -h /vast | grep -v Mounted"
- When you have enough space, you can stop the process.
Please engage Support if you find you need to delete any trace or metric data less than 30 days old. Please also check with support prior to deleting any files or folders in the userdata directory.
Comments
0 comments
Article is closed for comments.