There is a simple way to do this.
Specifically, you have already mentioned you have a bitmask extraction procedure.
Therefore, given $p(x)$, $p_2(x)$, and $p_2'(x)$ (your homomorphic operations applied to $p_2$), let $p_3(x)$ be the result of applying the same bitmask to $p_2'$.
Then, it is straightforward to verify that
$$p(x) - p_2(x) + p_3(x)$$
gives you the result you want.
This reduces everything to two "bitmask extraction" procedures, the homomorphic evaluation (which seems unavoidable), and a few additions (which should be cheap).
There are then natural questions:
- can one bitmask extraction suffice?
- How can one efficiently apply bitmask extraction?
If the slot you want to compute on is public, it should suffice to multiply by a suitable constant (with 0/1 coefficients) polynomial, giving a multiplicative overhead of 1 for each bitmask extraction.
Private bitmasks seems less efficient --- I can think of something that uses $O(n)$ multiplications (but at least has depth 1), essentially by computing a multiplication of an (encrypted) 0/1 boolean for each index to "select" the right indices, and then adding everything in the end.
I do not know how the above compares to state-of-the-art though.
It's also worth mentioning that if your operations on $p_2(x)$ do not depend on the other coordinates $\phi(a_i)$ (but may simply "overwrite" them), one can remove one of (namely the first) bitmask extraction operations.
This depends on the particular function you are evaluating of course though.