Well, I figure that if you really trim things down [1], you could get the signature down to just under 100 bytes. Here are how I adjusted your requirements:
I want 128 bit security or better (although its not a hard requirement).
Since you said it wasn't a hard requirement, I reset that to 80 bits security (if you do this with 128 bit security, you get a signature of about 200 bytes). By targeting 80 bits security, we can use 80 bit hashes (for example, SHA-256 truncated to 80 bits - we could try to use a faster hash function with a lesser security level, however that speeds things up by only a modest factor, so I don't expect that buys us much).
I will be signing blocks of data that are 128 bytes or smaller.
Not an issue
The signature scheme doesn't need to be widely-used, standardized by any authoritative body, or compatible with any other systems.
Nobody is insane enough to standardize this approach
Efficiency of computing the keys or signature is not a major factor, just as long as it doesn't take more than say a minute or two to do either for a 128 byte block.
Well, computing the public key is the long pole here, so I'll take a 2 minute limit on that. However, you didn't say what the hardware was, so I'll assume decently aggressive hardware there - 8 fast cores with AVX512 hardware (actually, we could go more aggressive with dedicated hardware, however that wouldn't allow us to shrink the signature that much)
Lets assume that I only want to issue around 10~20 signatures and then throw away the private key forever.
I'll assume that 20 signatures is the max (a max of 16 would allow us to reduce the signature size by another 10 bytes).
Ok, here's what we do: we start with an LMS/XMSS design (either modifying LMS to allow a larger W factor, or removing the bitmasking of XMSS and stir in the location another way); then:
We have a target of 20 signatures; this implies a Merkle height of 5 (which we would be able to trim back to 4 if we could assume a maximum of 16 signatures)
A measurement on my Xeon processor shows that a single core can, with its AVX2 instructions, compute 6.6M hashes per second; if we were to use a faster processor with AVX512 instructions, we could probably get that up to 20M hashes per second. We have 8 cores, and we have 120 seconds to compute keys, so we have time to compute not quite 20 billion hashes.
Now, we need to generate 20 one-time-signature public keys [2] (and everything else is comparatively trivial) - so, we have time for almost 1 billion hashes per one-time-signature
If we use a Winternitz value of $W \approx 200,000,000$, that would allow us to encode about 27 bits. Now, the hash that the one-time signature signs is 80 bits; that means that, with constant sum encoding, we should easily generate 4 base-200,000,000 digits (method; feed the value being signed into SHAKE; generate 3 base-200,000,000 digits; see if there is a fourth digit such that the sum is the target; if not, then go back and generate more digits)
- Actually, because computing individual winternitz chains is not parallelizable, the timing taken (with the given hardware estimates) would actually be about 2 minutes 40 seconds. I feel that's still within the "[no] more than a minute or two" requirement (and if not, get slightly faster hardware...)
So, the signature would consist of:
The one-time signature of the hash; 4 hashes of 10 bytes each = 40 bytes
The authentication path; 5 hashes of 10 bytes each = 50 bytes
Hash randomizer (which you need to prevent possible collision attacks; this is randomness you stir into the message when you hash it); 40 bits = 5 bytes
The index (actually, you could drop this, and just have the verifier check all possible indices; I didn't think that a 1 byte savings was worth it); 1 byte
Total: 96 bytes (with a possibility of 86 bytes if you only need to generate 16 signatures)
[1]: Have you seen the movie "The Martian" (or better yet, read the book - it goes into more details)? If so, what I'm doing feels a lot like what Mark Watney did to the MAV...
[2]: For those Merkle leaf nodes that we'll never use, we don't actually have to compute those one-time public keys - instead, we can just put in dummy values. We won't be able to sign with them - we never intend to