Score:18

Ubuntu auto delete oldest file in directory when disk is above 90% capacity, repeat until capacity below 80%

bv flag

I have found a few similar cron job scripts but nothing exactly how I need them and I do not know enough about scripting for Linux to try modify the code when it comes to this sort of job which could turn disastrous.

Essentially I have ip cameras that record to /home/ben/ftp/surveillance/ but I need to ensure there is always enough space on the disk to do so.

Would someone please be able to guide me how I can setup a cron job to:

Check if /dev/sbd/ has reached 90% capacity. If so then delete the oldest file in(and files in sub folders) /home/ben/ftp/surveillance/ And repeat this until /dev/sbd/ capacity is below 80% Repeat every 10 minutes.

J... avatar
in flag
What if `/dev/sdb` (note : `sdb` != `sbd` - watch for typos in your scripts!) ever becomes >80% full regardless of the contents of `/home/ben/ftp/surveillance/`? This will always delete all of your recordings, leaving you with nothing. Better to have CCTV cameras recording to a dedicated volume that is not shared with (ie) your operating system or any other users. Ideally the cameras should manage this themselves, being aware of which files are theirs and overwriting their own oldest files when the dedicated drive space becomes full.
bv flag
Good point J... My ftp directory is actually an external drive mounted in my home dir and is now dedicated to just cam recordings. I haven't bothered to change the folder structure it has previous to being a dedicated disk. The cameras also upload a .jpg still of every recording so I will setup another cron to delete all .jpg periodically.
Score:33
in flag

Writing these kinds of scripts for people always makes me nervous because, in the event anything goes wrong, one of three things will happen:

  1. I'll kick myself for what's probably a n00b-level typo
  2. Death threats will come my way because someone blindly copy/pasted without:
    • making an effort to understand the script
    • testing the script
    • having a reasonable backup in place
  3. All of the above

So, to reduce the risk of all three, here is a starter kit for you:

#!/bin/sh
DIR=/home/ben/ftp/surveillance
ACT=90
df -k $DIR | grep -vE '^Filesystem' | awk '{ print $5 " " $1 }' | while read output;
do
  echo $output
  usep=$(echo $output | awk '{ print $1}' | cut -d'%' -f1  )
  partition=$(echo $output | awk '{ print $2 }' )
  if [ $usep -ge $ACT ]; then
    echo "Running out of space \"$partition ($usep%)\" on $(hostname) as on $(date)"
    oldfile=$(ls -dltr $DIR/*.gz|awk '{ print $9 }' | head -1)
    echo "Let's Delete \"$oldfile\" ..."
  fi
done

THINGS TO NOTE:

  1. This script deletes nothing

  2. DIR is the directory to work with

  3. ACT is the minimum percentage required to act

  4. Only one file – the oldest – is selected for "deletion"

  5. You will want to replace *.gz with the actual file type of your surveillance videos.
    DO NOT USE *.* OR * BY ITSELF!

  6. If the partition containing DIR is at a capacity greater than ACT, you will see a message like this:

    97% /dev/sda2
    Running out of space "/dev/sda2 (97%)" on ubuntu-vm as on Wed Jan 12 07:52:20 UTC 2022
    Let's Delete "/home/ben/ftp/surveillance/1999-12-31-video.gz" ...
    

    Again, this script will not delete anything.

  7. If you are satisfied with the output, then you can continue to modify the script to delete/move/archive as you see fit

Test often. Test well. And remember: When putting rm in a script, there is no undo.

ao flag
It's silly to pipe `grep` to `awk`, you can just add `!/^Filesystem/` to the beginning of the awk command. Or you can tell `df` to only produce the output you want, and use `sed` to strip the header and percent sign: `usep=$(df --output=pcent $DIR | sed '1d;s/%//')`
Peter Cordes avatar
fr flag
`echo $output` should quote `"$output"` since you don't specifically need it to be word-split. It's probably fine, but even on normal systems, some removable media ends up mounted on paths with spaces in their name which could lead to error messages from this in practice. So rule of thumb, always quote variable expansions unless you specifically want word-splitting or some other effect that quoting blocks.
ua flag
Similar, but using the `firstaction` feature of `logrotate`: https://serverfault.com/questions/372809/free-space-driven-log-rotation-on-linux
Eric Duminil avatar
us flag
@NickMatteo If the corresponding code works fine and is readable, it's silly to argue that it's silly because it could be written differently with a 45-year-old-and-not-very-readable-language. UNIX pipes are awesome and readable, who cares if the command could be written with one or two fewer pipes?
ao flag
@EricDuminil: The somewhat common use of `grep 'PATTERN' | awk '{do stuff to matching lines}'` is objectively silly, since awk's whole purpose is to do stuff to matching lines, and you could have just written `awk '/PATTERN/ {do stuff}'`.
Eric Duminil avatar
us flag
@NickMatteo: You're free to use whatever tools you're familiar with. matigo seems to be happy with `grep ... | awk ...`, you'd be happy with `awk '/PATTERN/ {do stuff}`, and I'd be happy with a short Ruby or Python script. As long as the scripts are working and readable, nobody's wrong, and no command is silly, and surely not objectively silly.
cn flag
@EricDuminil They may not be objectively silly, but there is an argument that a Q&A site should teach people how commands work. Piping `awk` makes it look like the command can't handle it by itself. There's also value in not using multiple commands through pipe and slowing the process down. In short, there are criteria to argue that using `awk` as it is intended to be used is the better Answer.
bv flag
@matgio - I couldn't get `oldfile=$(ls -dltr $DIR/*.mp4|awk '{ print $9 }' | head -1)` to work but I did get this to work `oldfile=$(find $DIR -name "*.mp4" -type f | sort | head -n 1)` If there any reason my method shouldn't be used?
in flag
@Beno – Do you receive an error with the first command? Either way, if `find` will give you what you need, feel free to use that. The script is just a "starter kit" to get you going with some of the preliminary stuff
Sorpigal avatar
za flag
Using the output of ls as input for anything besides human eyeballs is fundamentally broken. Do not do this.
Score:12
kz flag

I would use Python for such task. It might lead to more code than a pure bash solution, but:

  • it's (IMO) easier to test, just use pytest or unitest module
  • it's readable for non Linux people (well except the get_device function which is Linux specific...)
  • it's easier to get started (again IMO)
  • What if you want to send some emails ? To trigger new actions ? Scripts can be enriched easily with a programming language like Python.

Since Python 3.3, shutil module comes with a function named disk_usage. It can be used to get the disk usage based on a given directory.

The minor problem is that I don't known how to easily get the name of the disk, I.E, /dev/sdb, even though it's possible to get its disk usage (using any directory mounted on /dev/sdb, in my case $HOME for example). I wrote a function called get_device for this purpose.

#!/usr/bin/env python3
import argparse
from os.path import getmtime
from shutil import disk_usage, rmtree
from sys import exit
from pathlib import Path
from typing import Iterator, Tuple


def get_device(path: Path) -> str:
    """Find the mount for a given directory. This is needed only for logging purpose."""
    # Read /etc/mtab to learn about mount points
    mtab_entries = Path("/etc/mtab").read_text().splitlines()
    # Create a dict of mount points and devices
    mount_points = dict([list(reversed(line.split(" ")[:2])) for line in mtab_entries])
    # Find the mount point of given path
    while path.resolve(True).as_posix() not in mount_points:
        path = path.parent
    # Return device associated with mount point
    return mount_points[path.as_posix()]


def get_directory_and_device(path: str) -> Tuple[str, Path]:
    """Exit the process if directory does not exist."""
    fs_path = Path(path)
    # Path must exist
    if not fs_path.exists():
        print(f"ERROR: No such directory: {path}")
        exit(1)
    # And path must be a valid directory
    if not fs_path.is_dir():
        print(f"Path must be a directory and not a file: {path}")
        exit(1)
    # Get the device
    device = get_device(fs_path)

    return device, fs_path


def get_disk_usage(path: Path) -> float:
    # shutil.disk_usage support Path like objects so no need to cast to string
    usage = disk_usage(path)
    # Get disk usage in percentage
    return usage.used / usage.total * 100


def remove_file_or_directory(path: Path) -> None:
    """Remove given path, which can be a directory or a file."""
    # Remove files
    if path.is_file():
        path.unlink()
    # Recursively delete directory trees
    if path.is_dir():
        rmtree(path)


def find_oldest_files(
    path: Path, pattern: str = "*", threshold: int = 80
) -> Iterator[Path]:
    """Iterate on the files or directories present in a directory which match given pattern."""
    # List the files in the directory received as argument and sort them by age
    files = sorted(path.glob(pattern), key=getmtime)
    # Yield file paths until usage is lower than threshold
    for file in files:
        usage = get_disk_usage(path)
        if usage < threshold:
            break
        yield file


def check_and_clean(
    path: str,
    threshold: int = 80,
    remove: bool = False,
) -> None:
    """Main function"""
    device, fspath = get_directory_and_device(path)
    # shutil.disk_usage support Path like objects so no need to cast to string
    usage = disk_usage(path)
    # Take action if needed
    if usage > threshold:
        print(
            f"Disk usage is greather than threshold: {usage:.2f}% > {threshold}% ({device})"
        )
    # Iterate over files to remove
    for file in find_oldest_files(fspath, "*", threshold):
        print(f"Removing file {file}")
        if remove:
            remove_file_or_directory(file)


def main() -> None:

    parser = argparse.ArgumentParser(
        description="Purge old files when disk usage is above limit."
    )

    parser.add_argument(
        "path", help="Directory path where files should be purged", type=str
    )
    parser.add_argument(
        "--threshold",
        "-t",
        metavar="T",
        help="Usage threshold in percentage",
        type=int,
        default=80,
    )
    parser.add_argument(
        "--remove",
        "--rm",
        help="Files are not removed unless --removed or --rm option is specified",
        action="store_true",
        default=False,
    )

    args = parser.parse_args()

    check_and_clean(
        args.path,
        threshold=args.threshold,
        remove=args.remove,
    )


if __name__ == "__main__":
    main()

If you need to orchestrate many tasks using CRON, it might be worth putting together some Python code as a library, and reuse this code across many tasks.

EDIT: I finally added the CLI part in the script, I think I'll use it myself

qwr avatar
kr flag
qwr
fwiw you can send emails from command line.
gcharbon avatar
kz flag
I'm not saying that it cannot be done in CLI, i'm saying that OP is not familiar with bash, and tant he might find it easier to do it in Python
bv flag
Thanks for the comprehensive post! I do like the more readable a approach but unfortautely I am not so familiar with phython either. But you have made me realise I could do this with php which I am a lot more comfortable with.
Clumsy cat avatar
cn flag
+1 for something easier to test. I'd want unit tests for a script like this.
Score:1
tj flag

Check if /dev/sbd/ has reached 90% capacity. If so then delete the oldest file in(and files in sub folders) /home/ben/ftp/surveillance/ And repeat this until /dev/sbd/ capacity is below 80% Repeat every 10 minutes.

The script below will do exactly that (provided that you add it to your crontab to run in 10 minute intervals). Be extra sure this is what you really want to do, since this could easily erase all files in /home/ben/ftp/surveillance/ if your disk is filling up somewhere outside this directory.

#!/bin/sh
directory='/home/ben/ftp/surveillance'
max_usage=90
goal_usage=80
[ -d "$directory" ] || exit 1
[ "$max_usage" -gt "$goal_usage" ] || exit 1
[ "$( df --output=pcent $directory | \
    grep -Ewo '[0-9]+' )" -ge "$max_usage" ] || exit 0
dev_used="$( df -B 1K --output=used $directory | \
    grep -Ewo '[0-9]+' )"
goal_usage="$( printf "%.0f" \
    $( echo ".01 * $goal_usage * \
    $( df -B 1K --output=size $directory | \
        grep -Ewo '[0-9]+' )" | bc ) )"
echo "$( find $directory -type f -printf '%Ts,%k,\047%p\047\n' )" | \
    sort -k1 | \
        awk -F, -v goal="$(($dev_used-$goal_usage))" '\
            (sum+$2)>goal{printf "%s ",$3; exit} \
            (sum+$2)<=goal{printf "%s ",$3}; {sum+=$2}' | \
                xargs rm

How this script works:

The first 3 lines after the shebang are the variables per your parameters:

  • directory is the full path to the parent directory containing the files and subdirectories from which you want remove old files (i.e., /home/ben/ftp/surveillance). The quotes around this value are not necessary unless the path contains spaces.
  • max_usage is the percent of disk capacity that will trigger the old file deletion actions (i.e., 90 percent).
  • goal_usage is the percent of disk capacity you want to achieve after deleting old files (i.e., 80 percent).

Note that the values of max_usage and goal_usage must be integers.

[ -d "$directory" ] || exit 1
  • Checks that directory exists, otherwise script ends and exits with status 1.
[ "$max_usage" -gt "$goal_usage" ] || exit 1
  • Checks that max_usage is greater than goal_usage, otherwise script ends and exits with status 1.
[ "$( df --output=pcent $directory | \
    grep -Ewo '[0-9]+' )" -ge "$max_usage" ] || exit 0
  • Gets the current disk capacity percent used and checks if it meets or exceeds the threshold set by max_usage. If not, further processing is not required so the script ends and exits with status 0.
dev_used="$( df -B 1K --output=used $directory | \
    grep -Ewo '[0-9]+' )"
  • Gets the currently disk capacity kilobytes used.
goal_usage="$( printf "%.0f" \
    $( echo ".01 * $goal_usage * \
    $( df -B 1K --output=size $directory | \
        grep -Ewo '[0-9]+' )" | bc ) )"
  • Converts the goal_usage variable to kilobytes (we'll need this value further down).
find $directory -type f -printf '%Ts,%k,\047%p\047\n'
  • Locates all files in directory (and in all its subdirectories) and makes a list of these files, one per line, formatted as timestamp, size in kilobytes, 'full/path/to/file'. Note that the 'full/path/to/file' is enclosed in single quotes so spaces in the names of files or directories will not cause problems later.
sort -k1
  • Sorts the previously echo'd list of files by timestamp (oldest first).
awk -F, -v goal="$(($dev_used-$goal_usage))"
  • awk creates an internal variable goal that is equal to the difference between dev_used and goal_usage - and this is the total kilobytes worth of files that must be removed in order to bring the disk capacity percent down to the goal_usage set at the start of the script.
(sum+$2)>goal{printf "%s ",$3; exit} \
(sum+$2)<=goal{printf "%s ",$3}; {sum+=$2}'
  • awk (continued) begins processing the list by keeping a running sum of field 2 values (size in kilobytes) and printing field 3 values ('full/path/to/file') to a space separated string until the sum of kilobytes from field 2 becomes greater than the goal, at which point awk stops processing additional lines.
xargs rm
  • The string of 'full/path/to/file' values from awk is piped to xargs which runs the rm command using the string as its arguments. This removes those files.
DanRan avatar
us flag
@fuzzydrawrings This script is great an essentially what I need. However, I have a 10TB hard drive, and specifying percentages without a 10th or a 100th decimal place means that just 1% is equal to 100GB of space. I don't want to clear that much space on delete. Can I specify max_usage/goal_usage as 90.01 or 90.001? Does max_usage and goal_usage allow for decimal points in the percentage in this script? If not, how can I adjust it?
Score:0
za flag

The existing answers have some problems.

There's one answer written in Python, but you shouldn't write shell scripts in Python (or, if you can help it, anything at all). The python script does shell-like things more awkwardly (and using much more code). There is a natural language to use for these operations, and that is shell. For some things you should avoid shell code, for some things you should not.

The highest-voted answer uses shell, but it does some things that are between stylistic problems and buggy behavior. Here are a few:

  • Upper case variable names should not be used so as to avoid colliding with any current or potential future POSIX-standardized variable names. Lower case variable names are explicitly reserved for use by applications such as this.
  • echo should not be used because it is not portable; printf is the canonical, portable alternative.
  • The output of ls is for human-eyeball-consumption only. The text it emits is not guaranteed to be equal to the real name of any file. In addition, it is not safe if there may be a crafted filename designed to do harm (harm here could be "delete any file on the system"). If you have GNU ls you can at least use --quoting-style=shell-escape to mitigate this somewhat. It's still better to simply avoid ls
  • Paths with whitespace in them are not handled correctly and won't work.

The other answer written in shell is much better, but still has some problems.

  • Uses /bin/sh but still not portable due to the use of non-portable shell commands and switches.
  • Still breaks on paths with whitespace in them.
  • Fails to quote all expansions, which is also vulnerable to crafted filename attacks.

These issues are more minor and could be fixed pretty easily, but I still don't like the approach.

Here is my alternative written in bash which should be safe and reliable.

#!/bin/bash
dir=${1:-/home/ben/ftp/surveillance/}
threshhold=90

! [[ -d $dir ]] && {                                                                                                            
    printf '%s: not a directory\n' "$dir" 1>&2                                                                              
    exit 1                                                                                                                  
}

use_percent=$(df --output=pcent "$dir" | tail -n 1)
if (( ${use_percent%'%'} < threshhold )); then
    exit
fi

recent-files () {
    [[ -z $1 ]] || [[ $1 == *[!0-9-]* ]] && return 1
    local number="$1" i=0 rev=(-r)
    shift
    if (( number < 0 )); then
        ((number*=-1))
        rev=()
    fi
    find "${@:-.}" -maxdepth 1 -type f -printf '%T@/%p\0' | \
    sort -zn "${rev[@]}" | cut -z -d/ -f2- | \
    while IFS= read -rd '' file; do
        printf -- '%s\0' "$file"
        if (( ++i == number )); then
            exit
        fi
    done
}

oldest=$(recent-files -1 "$dir")
size="$(stat -c %s "$oldest" | numfmt --to=iec-i)"
{
    printf 'Running out of space for %s\n' "$dir"
    printf 'Removing "%s"; %s freed.\n' "$oldest" "$size"
    rm -f -- "$oldest"
} | logger -s -t surveillance-monitor -p local0.warning

I make use of bash-specific features and assume GNU coreutils. A fully sh compatible version that uses only portable switches is theoretically possible, but it's much more work. Other solutions were in effect already tied to non-portable df switches, among other things, so little of value is lost. I am explicitly assuming GNU/Linux here, where everyone else was implicitly assuming it. If you have GNU/Linux you can write better scripts, and when you can be better you should be better.

The output messages from this are logged via syslog, so you don't need to check mail from crond if you don't want to. In fact I would suggest this cron job:

*/10 * * * * /path/to/this/script 1>/dev/null 2>&1

This assumes support for vixie cron syntax and will run the script every 10 minutes and discard all output. You can check syslog for the results. (The usual caveat about cron job PATH applies, but the default system PATH ought to include every utility I have referenced).

Shellcheck reports no issues for this script, which is something you should always check before running code you find on the internet.

But how does it work?

I'll explain the parts that I think are least obvious to the average observer, but not everything.

The [ -z $1 ]] || [[ $1 == *[!0-9-]* ]] check makes sure that the first argument is a positive or negative number. The function requires such a first argument, so this is just a little safety.

The find ... -printf '%T@/%p\0' part produces a NULL-delimited output, because NULL is the only safe delimiter to use when filenames are involved. Each record produced this way has a prefix that ends in a / which contains the file's mtime described as the Unix epoch in seconds followed by . followed by fractional seconds. E.g. 1679796113.043092340

I chose / as the delimiter arbitrarily; any other delimiter character that is not [0-9\.] would work just as well. You could, and arguably should, also use \0 here.

Using sort -z tells sort to expect input records to have a NULL delimiter and to produce output records with the same delimiter. Using cut -z is the same thing for cut; the other cut switches remove the time portion--once the output is ordered by mtime we stop caring about the actual value.

The while loop is used to limit the number of emitted files to the specified (in this case -1). A positive value gives the N most recent files, a negative value gives the N least recent files--which is what this situation calls for.

The numfmt invocation turns the raw number of bytes produced by stat into a human-readable version (using IEC binary notation). This is unnecessary, but friendly.

The logger command may not be as widely known as it deserves, but it makes writing to syslog from scripts trivial.

I think everything else is sufficiently self-explanatory.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.