Score:0

Linux server power outage - data integrity and OS corruption checks

aq flag

What are the best practice steps to check for / recover from potential Linux OS corruption, in the event of unexpected power outage (or VM host failure)?

Of course it "depends" on the installation and setup, but I'm looking for common actions / checks to perform for majority of Linux OS (Debian, Ubuntu, Mint, other) and file systems (XFS, ZFS, EXT4, vfat).

This is not about preventing ungraceful shutdowns - it's about dealing with when they do happen and to try and ensure best case recovery.

I'm aware that OSes tend to detect when a filesystem wasn't unmounted (like it would during a graceful shutdown) and therefore automatically perform checks upon boot-up, but what are those checks and how to I perform them manually?

E.g. e2fsck -f is one such tool, but for the uninitiated, when can/should it be used and when not (or won't it work)?

As an example, in Windows you might do:

  • Check for NTFS file system corruption using the old chkdsk or the modern equivalent of repair-volume (in PowerShell)
  • Verify integrity of OS core system files using sfc.exe /scannow

Application specific verification/recovery steps like MySQL database or LDAP directories etc., fall outside of the scope this question - unless it's a very common thing, like some OS database e.g. apt or snap databases.

What do you do?

Score:1
ca flag

Modern filesystems do metadata journal, meaning that a plain power outage should not pose any issue for the filesystem integrity itself: completed-but-uncommited transactions are replayed, while partial transactions are rolled-back.

In-flight or cached data, however, can be lost or partially written - after all, if an application handles some data to the OS for async write (ie: normal writes) but the machine loses power before the OS can writeback them to permanent storage, data will be lost.

For this very reason, critical application as databases (except MyISAM) implement its own journaling and write data with sync semantic - using fsync() and the likes.

In short: an unplanned shutdown does not generally need any filesystem repair (beyond automatic journal replay). Applications checking depend on the application itself, with most databases being unaffected by a sudden powerloss - unless using MyISAM, which requires running mysqlcheck

Jaans avatar
aq flag
Thank you for that. It seems general to accept that this is the case with journaled filesystems, which even Microsoft's NTFS is (in the comparison above). Are there any Linux OS / Package verification or even checksum functions available to validate that a file hasn't been corrupted (maybe even modified by some root kit / malware)?
shodanshok avatar
ca flag
On Windows you can use `sfc /scannow` for checking system files integrity, while you can use [FCIV](https://en.wikibooks.org/wiki/File_Checksum_Integrity_Verifier_(FCIV)_Examples) for other files. Please note that this method does not work for rapidly changing files (ie: database files), simply because even normal operations *will* change the checksum.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.