The challenging part is mostly about image processing, and not a lot about crypto.
You want to extract from the image sufficient entropy in a reliable fashion.
It has a lot to do with how you use the image and your adversary model.
If your adversary knows nothing at all about the image, some simple coarse features could be sufficient, if your adversary knows a lot about the image, e.g has gotten a glance at it, or knows aproximately where it was taken you need to be more careful on what information is extracted from the image and will need to use finer grained image features which will be hard to extract in a stable fashion.
If each time the image is used it is scanned in the same high quality scanner, and between usages it is kept safely so it doesn't fade, wrinkle or accumulate dust it would be easier to get scans very close to each other and have only simple auto alignment and discretization (spatial and color) to get almost the same bit sequence each time.
Then the question is what the error model we have for the scan results? Do we expect gaussian noise? salt and peper? alignment noise? rotation? addition of large continious pieces of noise? lighting noise?
Each type of noise can be dealt with differently.
A general outline for a solution: We use image processing techniques to minimize the noise to move to a representation which eliminates most of them then you limit the space to only certain valid points and pick the nearest valid point to what we have to bring noise down to zero.
We will discretize aggessively enough and pick sparse enough valid points to allow us to get to zero noise reliably. At this point we should still have much more than the required key length but in a space still closely related to the original image and as such the bits will be be biased and correlated.
Applying a cryptographic hash to that data should sort it out and allow us to have sufficient high quality key material derived reliably, and get the same key exactly any time you scan. This could be used as e.g an AES key.
If you want to create an RSA key you will need many more random bits. You can however extract as many bits as you can extract while still reliably getting same bits every time and use that to seed a cryptographic PRNG and use it to generate an RSA private key.
Edit: I didn't try to implement a full solution, but I did open a notebook and play with the noise model suggested, gaussian noise and shifts I believe are corrected easily, so I checked what happens if I rotate the image (with fancy interpolations) by 2 degrees and rotate back by 1.8 degrees I got a maximal diff (on the image above) of 33%, this is supportaive of my claim that by identifying best counter rotation and shift, lowering resolution and quantizing aggressively ignoring edges we should be able to get 1-2 bits per channel per ~25 pixel regions. For the above image it comes out at least 36K bits, and after hashing I bet this will have 128 bits of actual entropy
Edit2: I downloaded the provided images of grey scale scans, and played with them, I semi-automatically aligned at rotated the first two image.
img = io.imread("scans/scan078.tif")
img2 = io.imread("scans/scan079.tif")
imgr = transform.rotate(img,angle = -0.78)
imgr2 = transform.rotate(img2,angle = -0.805)
tr1=transform.rescale(imgr[:-10,:-6],0.1)[20:-20,20:-20]
tr2=transform.rescale(imgr2[10:,6:],0.1)[20:-20,20:-20]
This reads rotates each aligns and crops, downsamples 10x and crops to get rid of edges which may have artifacts.
This gives a Maximal difference of less than 6% per pixel value. Which is pretty good. However this 6% diff can easily be around any cut-off we choose so even quantizing aggressively doesn't give 0 errors.
bin1 = tr1> 0.5
bin2 = tr2> 0.5
This gave a difference in 103 bits out of 27248 bits or 0.37% These errors appear to be reasonably spread out.
This aggressive resizing and quantizing looses a lot of information but we probably still have enough.
This is what the image looks like:
The errors are fairly well spread out(and we can always apply a fixed permutation or use larger symbols if needed). So now we can apply any error correction step (e.g Reed solomon) we will just take the decoding step (didn't actually do this) and we should get the same output from either image with high likelyhood and still have ~20K bits.
If we down scale 5x instead of 10x we get 816 differing bits. but get 4x as many bits, at 0.6% difference. Can play with this and find optimum.
We can also probably do better at the quantization step and preserve more information reliably. The aggressive quantization I used will work only for reasonably balanced photos, an over exposed picture will come out all a single value. We could add preprocessing to handle this scenario.