Related - but not the same: Azure Cloud Service Upgrade Domain server restart interval
I have a cloud service (extended support) in Azure with two instances of a role, in an availability set, with different value for Update Domain - 0 and 1 - as can be seen in this screenshot.
When a deployment runs, or VMs are being updated by Azure (e.g. windows updates, etc.), I expect that the VM in Update Domain 0 would be completed and back up to "Started" state before the VM in Update Domain 1 would start the update. Unfortunately this is not the case.
Each VM startup takes a significant amount of time (around an hour) due to having to install custom software, copy data, start relevant services, etc. What I'm seeing is that instance in Update Domain 0 is being updated first, it's in status "Starting" while the other instance is in status "Started". Then after about 30-40 minutes, while the instance in Update Domain 0 is still in "Starting" state, the second instance starts its update process - resulting in both VMs being in "Starting" state - and as a result, the service is down for up to 20 minutes or so until the first instance completes the update.
I have another similar cloud service with 3 instances - I'm seeing exact same behaviour there. There's a period of about 5 minutes when all 3 instances are in status "Starting", even though they are in 3 different Update Domains.
Am I missing something or is this a bug of some sort in Azure Fabric? Maybe there's a hard limit on how long the fabric will wait before proceeding with the next instance?
Here's a screenshot of both instances being updated at the same time, even though they are in different Update Domains: