Score:2

Failed instance in google compute engine

in flag

I have an GCE instance which has been running for several years. During night, the instance was restarted with following logs:

2022-02-13 04:46:36.370 CET compute.instances.hostError Instance terminated by Compute Engine.
2022-02-13 04:47:08.279 CET compute.instances.automaticRestart Instance automatically restarted by Compute Engine.

However the instance did not restart.

I can connect to the serial console where I see this:

serialport: Connected to ***.europe-west1-b.*** port 1 (
[ TIME ] Timed out waiting for device ***
[DEPEND] Dependency failed for File… ***.
[DEPEND] Dependency failed for /data.
[DEPEND] Dependency failed for Local File Systems.
[  OK  ] Stopped Dispatch Password …ts to Console Directory Watch.
[  OK  ] Stopped Forward Password R…uests to Wall Directory Watch.
[  OK  ] Reached target Timers.
         Starting Raise network interfaces...
[  OK  ] Closed Syslog Socket.
[  OK  ] Reached target Login Prompts.
[  OK  ] Reached target Paths.
[  OK  ] Reached target Sockets.
[  OK  ] Started Emergency Shell.
[  OK  ] Reached target Emergency Mode.
         Starting Create Volatile Files and Directories...
[  OK  ] Finished Create Volatile Files and Directories.
         Starting Network Time Synchronization...
         Starting Update UTMP about System Boot/Shutdown...
[  OK  ] Finished Update UTMP about System Boot/Shutdown.
         Starting Update UTMP about System Runlevel Changes...
[  OK  ] Finished Update UTMP about System Runlevel Changes.
[  OK  ] Started Network Time Synchronization.
[  OK  ] Reached target System Time Set.
[  OK  ] Reached target System Time Synchronized.
         Stopping Network Time Synchronization...
[  OK  ] Stopped Network Time Synchronization.
         Starting Network Time Synchronization...
[  OK  ] Started Network Time Synchronization.
[  OK  ] Finished Raise network interfaces.
[  OK  ] Reached target Network.
[  OK  ] Reached target Network is Online.
You are in emergency mode. After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to r
Cannot open access to console, the root account is locked.
See sulogin(8) man page for more details.
Press Enter to continue.

It seems that one of the disks cannot be connected – but what can I do about it now? The disk seems to be normally available within the compute engine.

John Hanley avatar
cn flag
My guess is that there is a Google Cloud VPC network issue that is temporary. Try rebooting the instance. If you continue to have a problem, edit your question with details on the instance and its GCP configuration.
in flag
Thanks for your response. The instance still does not start correctly. What kind of details would be helfpul? The instance is `e2-small` running in `europe-west1-b` with two disks – one regular boot disk and one ssd disk which seems to be failing to attach.
John Hanley avatar
cn flag
I recommend opening a Google Cloud Support ticket.
PjoterS avatar
ve flag
Didn't you have any issue with billing? What is image of this VM? Did you change machine type lately? You've used Persistent SSD or Local SSD? You can create another VM without any issue? Did you try to execute `journalctl -xb` and `systemctl reboot`?
in flag
The instance cannot properly start - it gets stuck on the "press enter to continue", but then nothing happens. So I cannot try the journalcrl. On restart, it will hang with the same disk timeout like above. I dont have any billing issues, everything else is still correctly running. I did not recently even touch the machine, it just died during night. The disk are persistent disks.
in flag
I reported it to gce support but so far they were not very helpful and now i have been waiting 18 hours since last answer.
Score:2
ve flag

I am afraid that you cannot do anything with this affected VM.

In Host Events documentation or FAQ you can find information:

A host error (compute.instances.hostError) means that there was a hardware or software issue on the physical machine hosting your VM that caused your VM to crash. A host error which involves total hardware failure or other hardware issues might prevent live migration of your VM.

VM instance which is in the "Cloud", it's still a physical machine that is running your workload. Unfortunately this instance had a hardware or software failure and there is nothing you can do.

GCP introduced something called Live migration which prevents this kind of situation.

Compute Engine offers live migration to keep your virtual machine instances running even when a host system event, such as a software or hardware update, occurs, however I guess it's too late to configure this one.

...

Live migration keeps your instances running during:

  • Regular infrastructure maintenance and upgrades.
  • Network and power grid maintenance in the data centers.
  • Failed hardware such as memory, CPU, network interface cards, disks, power, and so on. This is done on a best-effort basis; if a hardware fails completely or otherwise prevents live migration, the VM crashes and restarts automatically and a hostError is logged.

...

Live migration does not change any attributes or properties of the VM itself. The live migration process just transfers a running VM from one host machine to another host machine within the same zone.

Possible Workaround

As you mention that disks are persistent and still visible in the GCP, you could try to reattach them to another VM. How to Guide can be found in Creating and attaching a disk documentation.

Score:1
in flag

I finally found the strange reason for this error - see original /etc/fstab:

/dev/disk/by-id/google-***-data /data ext4 discard,defaults 0 2

But there is no such device on this path. I solved this by attaching /dev/sdb instead, but I guess thi is not the best solution. I wonder how does this happen that the device suddenly completely disappears and in the end kills the machine.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.