From the literature, STC seems to be the current state-of-the-art for the coding part of steganography. From the description of the method, it appears to me it could be parallelized for GPU. Does developing of STC implementation for GPU even worth it? Are there any existing GPU implementations of STC.

The existing implementations of STC I know of are

This is (a) too open ended and broad; (b) arguably not on topic, possibly being opinion based since you say "does [...] make sense?" On the basis of what performance measure? Provide more details instead of links.
I am currently working on a vectorized version using Numpy. Although some details are missing, it works. So I have no doubt that it can be implemented on GPU. On the other hand, as far as I know, there is no public implementation for GPU.
