Score:1

Block device suddenly full; can't identify a single file as culprit and SMART shows no drive errors

et flag

Setup

  • Ubuntu 20.04
  • Dell PowerEdge R820
  • [PERC H710] 2x Virtual Drives (RAID-1 Boot, RAID-0 Work Drive)
  • Everything been fine for 6 months
  • No preceeding even, just suddenly, drive full.

Details...

This machine is used for plotting Chia (cryptocurrency) - it's been working away for months without issue.

I noticed the plotting process crashed (bladebit) - which is pretty uncommon, happens maybe once every 2 months - so I went to fire it back up and immediately started getting device full types of errors.

I fired off a quick df -h to see what was going on, and got this:

Filesystem          Size  Used Avail Use% Mounted on
udev                252G     0  252G   0% /dev
tmpfs                51G  2.9M   51G   1% /run
/dev/sda2           549G  512G  8.7G  99% /
tmpfs               252G  4.0K  252G   1% /dev/shm
tmpfs               5.0M     0  5.0M   0% /run/lock
tmpfs               252G     0  252G   0% /sys/fs/cgroup
/dev/sda1           511M  5.3M  506M   2% /boot/efi
tmpfs                51G     0   51G   0% /run/user/1000
<... SNIP ...>

/dev/sda2 is the boot drive - it's actually a RAID-1 (2-disk) Virtual Disk handled by the H710 RAID card in the server, but I don't think that's terribly relevant.

NORMALLY this drive is 3% full, it only has bootable Ubuntu Server 20.04 installed on it and nothing else.

I had to erase the tmp file in root and a few other garbage files to free up space enough to get things to function again, but it's sitting at dang near full.

I followed countless "find the biggest file on your server" tips from here and around the web, for example this one, with the command sudo du -a / 2>/dev/null | sort -n -r | head -n 20 returning:

$ sudo du -a / 2>/dev/null | sort -n -r | head -n 20
[sudo] password for user: 
1010830919685   /
1010823681740   /mnt
<...SNIP...>

Ok so something huge sitting in / apparently? A simple ls shows nothing of interest in there:

$ ls -lFa /
total 84
drwxr-xr-x   20 root root  4096 Jan 12 17:45 ./
drwxr-xr-x   20 root root  4096 Jan 12 17:45 ../
lrwxrwxrwx    1 root root     7 Aug 24 08:41 bin -> usr/bin/
drwxr-xr-x    4 root root  4096 Jan  6 06:22 boot/
drwxr-xr-x    2 root root  4096 Sep 28 14:04 cdrom/
drwxr-xr-x   21 root root  6920 Jan  5 16:05 dev/
drwxr-xr-x  105 root root  4096 Jan  5 01:54 etc/
drwxr-xr-x    3 root root  4096 Sep 28 14:18 home/
lrwxrwxrwx    1 root root     7 Aug 24 08:41 lib -> usr/lib/
lrwxrwxrwx    1 root root     9 Aug 24 08:41 lib32 -> usr/lib32/
lrwxrwxrwx    1 root root     9 Aug 24 08:41 lib64 -> usr/lib64/
lrwxrwxrwx    1 root root    10 Aug 24 08:41 libx32 -> usr/libx32/
drwx------    2 root root 16384 Sep 28 14:03 lost+found/
drwxr-xr-x    2 root root  4096 Aug 24 08:42 media/
-rw-r--r--    1 root root  6678 Jan  9 00:59 MegaSAS.log
drwxr-xr-x   64 root root  4096 Jan  5 01:48 mnt/
drwxr-xr-x    3 root root  4096 Nov 30 18:14 opt/
dr-xr-xr-x 1356 root root     0 Jan  3 04:40 proc/
drwx------    7 root root  4096 Nov 30 18:07 root/
drwxr-xr-x   34 root root  1100 Jan 12 08:04 run/
lrwxrwxrwx    1 root root     8 Aug 24 08:41 sbin -> usr/sbin/
drwxr-xr-x    9 root root  4096 Sep 28 22:06 snap/
drwxr-xr-x    2 root root  4096 Aug 24 08:42 srv/
dr-xr-xr-x   13 root root     0 Jan  3 04:40 sys/
drwxrwxrwt   13 root root  4096 Jan 12 17:15 tmp/
drwxr-xr-x   15 root root  4096 Aug 24 08:46 usr/
drwxr-xr-x   13 root root  4096 Aug 24 08:47 var/

Using sudo ncdu -x / (link) shows nothing interesting oddly enough:

    2.4 GiB [##########] /usr                                                                                                                                                                                                                 
    1.5 GiB [######    ] /var
  732.5 MiB [##        ] /home
  202.8 MiB [          ] /boot
    5.5 MiB [          ] /opt
    5.4 MiB [          ] /etc
    1.9 MiB [          ] /root
  168.0 KiB [          ] /tmp
<...SNIP...>

Where is this ~510GB of used space sitting?

Firing off a sudo lsof | grep deleted to see if there is some giant file being held onto, gave me this:

systemd-j    1134                               root   36u      REG                8,2 134217728    5246838 /var/log/journal/771d7f1addf64a7b930191976176149e/system@ae2f8b2397c441f7a286d25144be755f-0000000000315312-0005d4e51ab8f8e9.journal (deleted)
unattende    3932                               root    3w      REG                8,2       113    5246631 /var/log/unattended-upgrades/unattended-upgrades-shutdown.log.1 (deleted)
unattende    3932    3943 gmain                 root    3w      REG                8,2       113    5246631 /var/log/unattended-upgrades/unattended-upgrades-shutdown.log.1 (deleted)

Ok so it's holding onto a 134mb journal file, but that still doesn't explain why suddenly there is 510GB of the drive being taken up.

I also tried some additional searches, like this one, and resulted in nothing helpful either.

I eventually used megacli to check the SMART data off the 2 drives in the RAID-0 array and they have 0 errors reported so it doesn't seem like the array got damaged.

Any ideas or additional digging tricks I might try to figure out what is sucking up that space?

UPDATE #1 - I noticed when I typed top that buff/cache was almost exactly the size of the GB's that were being consumed on the root drive. I know that space isn't counted as used, but I decided to fire off a quick:

sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"

which took about 3mins to run but eventually returned - top now shows buff/cache as < 1k, BUT df -h shows no change in disk usage.

I had hoped it was a mystery cache file on disk or something like that.

et flag
UPDATE 2 - In the off chance I was "hiding" a massive file from myself by mounting on top of it, I `mount -o bind` my root dir to `/tmp/fake-root` to take a peek in the ROOT and `/mnt` directories just incase something was in there... didn't discover anything. This tip was from: https://unix.stackexchange.com/a/198543/509866
et flag
UPDATE 3 - Fired off a `sudo find / -type f -printf '%s %p\n' 2>&1 | grep -v 'Permission denied' | sort -nr | head -10` and unfortunately besides saying that `/proc/kcore` is like 100 EB, it didn't show me any other big files I didn't already know about.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.