How might the strength of this pen-and-paper-and-mind cipher compare to say, AES-256, in terms of difficulty to break?
It depends a lot on the setup and what we consider a "break". Here is one case where the human-generated OTP is blatantly insecure.
The watchtower of a military installation is sending a report to command every hours. It's critical that eavesdroppers can't tell if the watchtower has observed something out of the ordinary. Unusually long messages would be a telltale sign of that, and for this reason, the messages all are made the same length (say 1000 characters), by padding with space characters. When there's nothing to tell, which is by far the most common, the messages are space characters except perhaps for the first few ones.
All that traffic is OTP-encrypted: each week, command manually prepares a week worth of One Time Pads (like 170), each 1000-character and made in two identical copies. One copy is kept at command, the other securely conveyed to the watchtower. The pads are indexed so that their order is well-defined. Each side securely store the pads until they are used. Agreeing on a single AES key of like 24 characters every week would be much more convenient (which is the reason ciphers have been invented in the first place).
An eavesdropper can take the 1000-character messages intercepted, and submit each message to some statistical test: a bidirectional Chi-squared test of the frequency of individual characters would do. If the OTPs have been humanly generated without some form of mechanical help, the test will detect some bias (see this other answer for references), to a certain high degree of confidence measured as a p-value. That degree of confidence will be typically much higher for messages essentially consisting of spaces, than it is for genuine messages conveying observations. In our setup, this is a break.
I'm not telling that in this setup, the messages conveying actual information could be fully decoded (though perhaps, it could be told with some degree of confidence if they contained a certain keyword). That's unless some cardinal rule of One Time Pad is breached and pads are deterministically generated, or reused. Both have reportedly happened and allowed reading thru some meaningful messages.
Also, not all methods of generating the pads are equally insecure. A modest degree of mechanization allows to make good pads. For example, prepare Scrabble-like tiles with the characters, one tile each, in an opaque box. Shake the box before each draw, and immediately replace the character in the box. After drawing some number of pads, check that the box still has one each of every character (discard the pads if not). At two draws per second (including writing down the two pads with a carbon copy), the 170.000 weekly draws require about three 8-hours shifts per week.
The main issue with the OTP is not that it's insecure if used correctly. It's that is utterly inconvenient, thus is tends to be not used, or be used incorrectly.