In short, the simulator has extra power that the real prover doesn't have.
Suppose Alice wants to prove to me that she is a good sharpshooter. I paint a target on the side of a barn, and make her stand 100m away and shoot it. She hits the bullseye of the target, and I am convinced that she is an excellent sharpshooter.
The "transcript" of this protocol is the permanent record that I take away from the interaction. In this case, it's the side of a barn with a target painted on it, and a bullet hole in the bullseye of the target.
This "protocol" is zero-knowledge because I could have generated the transcript myself. I could have shot a hole in the side of the barn from close range, and then painted a target centered at the hole! When I'm doing this ("simulating" a transcript), I have more power than Alice did during the protocol. I can generate the pieces of the transcript in a different order. I can shoot the barn from closer range than her.
In cryptographic protocols, the simulator always has more power than the real prover. Sometimes the simulator can generate the parts of the transcript in a different order. Sometimes the simulator can "rewind time" -- so the verifier asks a question, and then we rewind time and start the transcript over, knowing what the verifier is going to ask. Sometimes the simulator literally has more computational power than the real prover. Sometimes the simulator has some extra information that the real prover doesn't have (like a trapdoor to some common reference information used in the proof).