Score:5

Theoretical Approaches to crack large files encrypted with AES

hk flag

I have a large file (> 200 Gb), that I encrypted a while ago with AES-256-CBC. The file itself is a tar which I ran through openssl. I've forgotten the exact password, but have a general idea of what it is.

Brute force is the easiest way to crack this from what I've seen (given the circumstances that I have a general theory of what the passwords might be), but the hitch I've run into is the time its taking me to actually try each combination. I have a script running on a server, which seems to be taking it ~ 15 minutes before spitting out that its wrong.

I can't help but think there has to be a better way to solve this.

kelalaka avatar
in flag
Well, the easiest way is the decryption of the last block. Check that it is valid PKCS#7 padding. I consider you may even forgot that you zipped the archive after the tar. Last block check is independent of the underlying file type in CBC mode.
Daniel S avatar
ru flag
@kelalaka PKCS#7 padding check has a false positive rate of slightly more than $2^{-8}$ which is not powerful enough for a large password corpus.
kelalaka avatar
in flag
@DanielS The OP already knows about combination of the passworod and it is great and easy filter independent of format. Once a hit reached, they can filter with the other. This was already used in [RSA DES challanges](https://crypto.stackexchange.com/a/64866/18298)
Score:10
mx flag

It sounds like you're decrypting the entire file. That's going to be slow.

The easiest option is to just truncate the encrypted file, try the decryption and check if the result is a valid tar file with file -b decrypted_output. Openssl will complain that the file is truncated but it still writes part of it. Enough for file to recognize anyways. Truncating the encrypted file to 2k gives a decrypt of 1.4K which file seems happy with.

NOTE:CHECK THAT THIS WORKS WITH YOUR VERSION OF OPENSSL ON AN ENCRYPTED TAR TEST FILE FIRST!!!

I'm getting 50ms roughly to decrypt a few KB mind you a few MB doesn't take much longer. Invoking file takes a few more ms but it's pretty tiny.

For other types of files, this approach also works. Entropy statistics could work for files of unknown formats.

Improving efficiency

Doing the key derivation on a GPU would speed things up. OpenSSL doesn't seem to do memory hard key derivation which means GPUs will be more efficient. The downside to that would be needing to re-implement all the crypto vs just truncating and using openssl which require close to zero effort.

There's also a bit of efficiency to be gained by looking for a magic constant value (EG:"ustar\0") https://en.wikipedia.org/wiki/Tar_(computing) but decrypting a few kb isn't a big deal compared to key derivation at all.

Score:3
ng flag

It's asked a theoretical approach, thus I'll suppose the question is not about using existing tools as in this answer, which would be off-topic; but rather I'll assume custom code for the password recovery. The general sketch is to test candidate passwords, approximately from most to least likely, by

  1. Turning the tested password into an 256-bit AES key.
  2. Testing that key against a portion of the file corresponding to known plaintext.

Notice the large file size is immaterial.

The algorithm for step 1 depends on version of openssl enc used for encryption, and settings used if any. Older versions of openssl enc derive the key using MD5 and EVP_BytesToKey with the iteration count set to 1, which is a criminal mistake from a security standpoint. The hash changed to SHA-256, reportedly at OpenSSL 1.1.0c. And then modern openssl enc can (if option -pbkdf2 or -iter is given) use PBKDF2 algorithm with a default iteration count of 10000 unless otherwise specified by the -iter command line option. Notice that the password derivation is usually salted, in which case the encrypted file starts with 8 bytes 53 61 6c 74 65 64 5f 5f (Salted__in ASCII), followed by the 8 bytes of salt which must be supplied to whatever password-to-key derivation function is used. PBKDF2 is not memory-hard and thus obsolete for new designs aiming at being secure, and PBKDF2-HMAC-SHA-256 with 10000 iterations is giving little protection against GPU, FPGA or ASIC-based password crackers, but is still is considerably less unsafe than with the iteration count set to 1 against a CPU-based attack, due to the less small iteration count. The new and old derivations are before and after this else statement (at time of writing).

In step 2, we need to find known plaintext. In the case of a TAR file we have two options

  • Every tar file has size multiple of 512 bytes. PKCS#7 padding is used by openssl enc, thus the CBC-encrypted file will have size modulo 512 either of 32 if salt was used, or 16 if not (which we can detect as above); and the last block of the padded plaintext is 16 times the byte 0x10.
  • Typically, a tar file has 16 times the byte 0x00 at offsets 80…95 (because that's zero-padding for a file name). These bytes will be at offset 96…111 if salt was used or 80…95 if not (which we can detect as above).

That known plaintext is easily tested since CBC is used: we can decipher with AES-256 (the block cipher) and candidate key the 16-byte ciphertext block that corresponds to the known plaintext block, XOR with the previous block, and compare to the known plaintext block. If there's no match, the key was wrong, thus the password was wrong. False positives are so improbable that they can be ruled out.

For other file formats the recognition of a correct key at step 2 could be more difficult. E.g. tar.gz files can happen to have byte size modulo 16 equal to 15, so that the known padding is a single byte at 0x01 and a test based on that has a 6.2% false positive rate. However, like most common file format, they have some fixed or recognizable bytes in the header, thus a reliable test remains possible.

David Ljung Madison Stellar avatar
ky flag
You should lookup what version of tar you used to check on the exact format. Also, if you happen to know the files that were encrypted, and also their likely modes and uid/gid, then you have many more bits you can check for a possible password. For example, for FreeBSD tar: https://man.freebsd.org/cgi/man.cgi?tar(5) You can probably guess the uid/gid and also the linkflag is probably null with an empty linkname, at least for the first entry (but you could do some tar tests to ensure this)..
kelalaka avatar
in flag
DId you forget the last block? It is much easier to test regardless of the file format. even consider that OP may also forgot that they have zipped file after the tar...
fgrieu avatar
ng flag
@kelalaka: indeed, the padding allows a test. I added that.
Score:-1
sa flag

If there was a not generally unknown shortcut to doing this it would imply either a new serious weakness in AES or in the implementation of CBC on openssl.

If you knew the IV you could decrypt block by block. To do this you would have had to have supplied the IV yourself. See openssl man page here

CBC by definition requires an IV. However, sometimes it is not passed directly, but derived together with the key from the password (and a randomly generated salt during encryption) using a key derivation function (namely PBKDF2). If you want to specify the key and IV directly, you must use the -K and -iv options.

Is this what you did? If yes, did you record the IV or can you guess it?

fgrieu avatar
ng flag
When encrypting in CBC mode, loosing the IV (in the sense that has for CBC) only looses the first block. In the case of a tar file, that's a file name.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.