I did not knew a technique that poncho proposes. It's considering integers modulo $n^2$ as integers of two superdigits in base $n$, saving much work (much like we can compute a product of two-decimal-digits integers modulo 100 with only three uses of a multiplication table).
With input and output in this two-superdigits form (as is the case inside modular exponentiation), the technique saves asymptotically $3/8$ (37%) of the work when using quadratic algorithms. A more relevant comparison is with Karatsuba multiplication at the top level, and then it saves $1/6$ (17%), with the additional benefit of lower overhead, since neither shifts nor signed intermediate value is involved. That's definitely worthwhile during modular exponentiation!
Other optimization techniques for multiplication (or exponentiation) modulo $n^2$ where $n$ is a large integer with no trivial factor (or $n$ prime), are those for multiplication (or exponentiation) modulo $n'$ with no trivial factor (or $n'$ prime), with $n'\approx n^2$. The later optimizations are well studied in RSA. However, the larger moduli in Paillier at equal security shift things towards making subquadratic algorithms like Karatsuba multiplication more attractive.
In Paillier encryption, the modulus $n^2$ has twice as many bits as $n$ in RSA, and the exponent is several hundred times as large as usual for the public-key side of things in RSA encryption. Thus Pailler encryption (and reencryption) can be thousands times slower than RSA encryption. But that's not as bad as it sounds because RSA encryption is typically hundreds time faster than RSA decryption.
For Pailler decryption, we can use the same time-saving technique as in RSA decryption: the Chinese Remainder Theorem. With $n$ the product of two primes $p$ and $q$, that will take the ciphertext modulo $p^2$ and deduce the plaintext modulo $p$, same for the plaintext modulo $q$, and then reconstruct the plaintext. That makes Pailler decryption very roughly 5 (typically at most 8) times slower than RSA decryption for the same $n$, at comparable optimization effort. The aforementioned "two-superdigits" technique also applies for decryption with CRT, on top of the numbers above.