Score:0

Automated way of testing network read/writes?

in flag

I have the following set up:

I have clients using software that "Is written using standard Windows API's for reads and writes" that are reading and writing files to a Windows server. The clients are all Windows 10 machines and they are all using a vault software called PDM from Solidworks.

The server is a Windows 2016 server running the PDM server software.

The basic workflow is the user works on a file locally. When they check the file into the vault the file is transferred from their hard drive to the server software. The server renames the file and saves it to a folder. As I do not have access to the code I am unable to determine exactly how it is done. I believe the rename is to prevent a user from "messing" with the file themselves as the file is stored in a cryptic folder and file naming structure.

We are seeing sporadic issues with files that are ending up corrupt upon loading at some point in the future. All these "corrupt" files seem to be able to be "saved" using a tedious and lengthy hand loading procedure. Since this issue is my data vault I wish to track the issue down.

According to the vault support people "95% of the time these issues are with the server or network and not with the vault server software".

Is there a way that you network admins know of to try reading and writing files repeatedly to and from a client/server to test for issues with reading and writing files over a network. My thought is to run a client/server that transfers files many, many times and checks hashes or something.

Score:0
cn flag
frr

Checking file transfers over a network: Have a MS fileserver and two clients (so that possible local caching at the client does not hamper your results).

  • On client machine #1, generate files (with random contents?), and save them to the server. At the same time, calculate a checksum (say md5sum?) for each test file, and maybe just append the sums, row after row, to a "sum handover index file" on the same server.

  • On client machine #2, load the files one by one from the server, and calculate checksum. You could actually just run a checksumming tool, such as the md5sum, on each file on a mapped network share (from the client #2 machine). And, compare the checksums with what the source machine has produced.

This is just a basic idea. You will need to do some scripting to automate the checks and let them run over a weekend or so. And maybe if the bitrot doesn't happen immediately, and takes place on the disk drives, perhaps this kind of a quick check won't find anything.

If you have a particular example / data corruption incident, is there any way to get your hands on the original file and the corrupt file, that should've been identical? What format are those files? Is this a machine-readable text file (such as XML) or something binary? Perhaps the contents could be compared, to see exactly what the corruption looks like. The nature of the corruption could provide further hints as to where this could be coming from.

Apart from the classic "diff" which works on ASCII text, I recall there are tools specifically for binary comparison.

Also, does the storage server run backups? Details depend on the backup scheme, but the point is that if an old server-side backup contains a healthy file, and the current file from the server is broken, that probably narrows down your problem. And to extrapolate this topic of backups, if the setup has a problem with data corruptions, and doesn't have bulletproof backup scheme in place, what more reason does your organization need to work on this?

Even if you no longer have the healthy original file: if the corrupt file ultimately gets rejected by a piece of software parsing it, it would be lovely to get a more detailed debug log, see where the parser grinds to a halt, where the file is no longer "well formed" - but I understand that in closed-source software working on a closed file format you typically don't get that opportunity.

People say that this is exactly the purpose of enterprise storage hardware that maintains "integrity metadata" all along the "signal chain" - i.e. modern versions of SCSI and derived interconnect technologies (FC, SAS) and the corresponding class of spinning rust, not sure if there are SSD's with that capability. Rather then providing particular pointers, I suggest that you ask google about data integrity in Linux. Chances are that this is exactly what has bitten you, at the hardware level. How much do you know about your underlying storage subsystem in the server?

And, although this is a slim chance: if you generally have a problem opening old files: does the application software working with the files get updated? Could this be a case of an application update breaking compatibility with old data? If you don't have backups of the whole files, could you perhaps arrange just getting a checksum noted somewhere, along with the filename and size and a timestamp, to see if the file retrieved 6 months later is the same or different?

Saved by a "tedious hand-loading procedure" - exactly what is that? Some recovery algorithm working on the broken file? Or just retrieving the healthy original file from a correspondingly old backup? Either case contains some potential to find out more about the nature of the problem: either you'll be able to compare the broken file to a good original file, or maybe learn what the "recovery" process actually does on the "corrupt" file.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.