Score:0

Openstack Yoga - compute services losing RabbitMQ connection during VM delete

sc flag

We are in progress of deploying small Openstack Yoga cluster on Centos 9 and HP Gen9 servers (single controller non-HA, and about 10 compute nodes; currently only one compute attached).

In general our current 1 controller + 1 compute setup seems to behave well (actions like new volume/port/VM creation and connecting it all toghether work OK; Openstack Dashboard accessible, no issues there).

We have a problem though during VM deletion - nova/cinder/neutron compute services seem to be losing connection to RabbitMQ on controller. This is causing an issues with properly deleting all VM-related resources, including VM itself which is entering Error state.

Below I am putting some log samples from instance deletion: [instance: f664dbae-9053-457a-b233-4d7348b5abc6, ~2022-10-11 10:40]

Do you have any suggestions how to deal with this issue?

thanks, Wojtek PL

LOGS - RabbitMQ logs controller (host-3) - during cnSBC AIO VM deletion: [root@host-3 /var/log/rabbitmq]# grep error [email protected] 135.210.211.57 - host-11 (compute), 135.210.211.73 - host-3 (controller)

2022-10-11 10:40:45.270606+02:00 [error] <0.25259.4> closing AMQP connection <0.25259.4> (135.210.211.57:50872 -> 135.210.211.73:5672 - neutron-openvswitch-agent:215678:8000e9b1-dc06-46d2-b8bf-9267d42af525):
2022-10-11 10:40:45.270606+02:00 [error] <0.25259.4> {inet_error,etimedout}
2022-10-11 10:40:45.334723+02:00 [error] <0.25280.4> closing AMQP connection <0.25280.4> (135.210.211.57:50886 -> 135.210.211.73:5672 - neutron-openvswitch-agent:215678:dd2031af-6622-4c2e-b051-afae4884b92a):
2022-10-11 10:40:45.334723+02:00 [error] <0.25280.4> {inet_error,etimedout}
2022-10-11 10:40:45.398605+02:00 [error] <0.25322.4> closing AMQP connection <0.25322.4> (135.210.211.57:50894 -> 135.210.211.73:5672 - neutron-openvswitch-agent:215678:f79fbbf2-9898-470d-ad4e-f5cdba818277):
2022-10-11 10:40:45.398605+02:00 [error] <0.25322.4> {inet_error,etimedout}
2022-10-11 10:40:45.526524+02:00 [error] <0.31938.6> closing AMQP connection <0.31938.6> (135.210.211.57:56004 -> 135.210.211.73:5672 - neutron-openvswitch-agent:215678:129f82c8-1ede-4432-9487-cf8a52f33be7):
2022-10-11 10:40:45.526524+02:00 [error] <0.31938.6> {inet_error,etimedout}
2022-10-11 10:40:45.846559+02:00 [error] <0.32092.6> closing AMQP connection <0.32092.6> (135.210.211.57:32912 -> 135.210.211.73:5672 - neutron-sriov-nic-agent:215662:debfc1e1-a3b7-487e-9db7-0cbb7771265d):
2022-10-11 10:40:45.846559+02:00 [error] <0.32092.6> {inet_error,etimedout}
2022-10-11 10:40:46.550629+02:00 [error] <0.31764.6> closing AMQP connection <0.31764.6> (135.210.211.57:44946 -> 135.210.211.73:5672 - nova-compute:1948:315b488e-4e39-4afa-963b-dcd8749a20f4):

LOGS - NOVA logs on compute (host-11) - during cnSBC AIO VM deletion: [root@host-11/var/log/nova]# tail nova-compute.log

2022-10-11 10:40:40.735 1948 INFO nova.virt.libvirt.driver [req-ba7e35f9-47a4-4cbe-866b-443b7c4a72e7 02f14854c6814feba0b29f6d56f48bbd 6a14bb3901cd4f74a6adc2983be0a52e - default default] [instance: f664dbae-9053-457a-b233-4d7348b5abc6] Deletion of /var/lib/nova/instances/f664dbae-9053-457a-b233-4d7348b5abc6_del complete
2022-10-11 10:40:52.738 1948 INFO nova.compute.manager [-] [instance: f664dbae-9053-457a-b233-4d7348b5abc6] VM Stopped (Lifecycle Event)
2022-10-11 10:41:02.465 1948 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection
2022-10-11 10:41:40.498 1948 ERROR oslo_service.periodic_task [req-30cbba9a-b129-45a9-a0e2-ab1b8780a51d - - - - -] Error during ComputeManager._sync_scheduler_instance_info: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 76875df5592a4048878f4111908f65b5
2022-10-11 10:41:40.498 1948 ERROR oslo_service.periodic_task Traceback (most recent call last):
2022-10-11 10:41:40.498 1948 ERROR oslo_service.periodic_task   File "/usr/lib/python3.9/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 441, in get
2022-10-11 10:41:40.498 1948 ERROR oslo_service.periodic_task     return self._queues[msg_id].get(block=True, timeout=timeout)
2022-10-11 10:41:40.498 1948 ERROR oslo_service.periodic_task   File "/usr/lib/python3.9/site-packages/eventlet/queue.py", line 322, in get
2022-10-11 10:41:40.498 1948 ERROR oslo_service.periodic_task     return waiter.wait()
2022-10-11 10:41:40.498 1948 ERROR oslo_service.periodic_task   File "/usr/lib/python3.9/site-packages/eventlet/queue.py", line 141, in wait
2022-10-11 10:41:40.498 1948 ERROR oslo_service.periodic_task     return get_hub().switch()
2022-10-11 10:41:40.498 1948 ERROR oslo_service.periodic_task   File "/usr/lib/python3.9/site-packages/eventlet/hubs/hub.py", line 313, in switch
2022-10-11 10:41:40.498 1948 ERROR oslo_service.periodic_task     return self.greenlet.switch()
2022-10-11 10:41:40.498 1948 ERROR oslo_service.periodic_task _queue.Empty
[...]
2022-10-11 10:41:40.498 1948 ERROR oslo_service.periodic_task
2022-10-11 10:41:41.437 1948 WARNING nova.servicegroup.drivers.db [-] Lost connection to nova-conductor for reporting service status.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 97d6d1842ade42a28b4aa6f01eb331e6
2022-10-11 10:41:41.437 1948 WARNING oslo.service.loopingcall [-] Function 'nova.servicegroup.drivers.db.DbDriver._report_state' run outlasted interval by 50.00 sec
2022-10-11 10:41:47.467 1948 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection
[...]
2022-10-11 10:41:52.741 1948 ERROR nova.virt.driver [req-ffced6ba-33ca-4573-9e1c-436097b7cd17 - - - - -] Exception dispatching event <LifecycleEvent: 1665477637.737095, f664dbae-9053-457a-b233-4d7348b5abc6 => Stopped>: Timed out waiting for a reply to message ID 0484d7b9c3e946f0a6b77ad2df3021c2: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 0484d7b9c3e946f0a6b77ad2df3021c2
2022-10-11 10:42:27.327 1948 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 76875df5592a4048878f4111908f65b5
2022-10-11 10:42:27.328 1948 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : e83baa1bf7a6486b94739c2fd8407182
2022-10-11 10:42:27.329 1948 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 97d6d1842ade42a28b4aa6f01eb331e6
2022-10-11 10:42:27.330 1948 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 0484d7b9c3e946f0a6b77ad2df3021c2
2022-10-11 10:42:27.333 1948 INFO nova.servicegroup.drivers.db [-] Recovered from being unable to report status.
2022-10-11 10:42:27.333 1948 WARNING oslo.service.loopingcall [-] Function 'nova.servicegroup.drivers.db.DbDriver._report_state' run outlasted interval by 35.90 sec
2022-10-11 10:42:27.351 1948 INFO nova.compute.manager [-] [instance: f664dbae-9053-457a-b233-4d7348b5abc6] Took 46.61 seconds to deallocate network for instance.
2022-10-11 10:42:27.352 1948 ERROR nova.compute.manager [req-ba7e35f9-47a4-4cbe-866b-443b7c4a72e7 02f14854c6814feba0b29f6d56f48bbd 6a14bb3901cd4f74a6adc2983be0a52e - default default] [instance: f664dbae-9053-457a-b233-4d7348b5abc6] Setting instance vm_state to ERROR: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID e83baa1bf7a6486b94739c2fd8407182
2022-10-11 10:42:27.351 1948 INFO nova.compute.manager [-] [instance: f664dbae-9053-457a-b233-4d7348b5abc6] Took 46.61 seconds to deallocate network for instance.
2022-10-11 10:42:27.352 1948 ERROR nova.compute.manager [req-ba7e35f9-47a4-4cbe-866b-443b7c4a72e7 02f14854c6814feba0b29f6d56f48bbd 6a14bb3901cd4f74a6adc2983be0a52e - default default] [instance: f664dbae-9053-457a-b233-4d7348b5abc6] Setting instance vm_state to ERROR: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID e83baa1bf7a6486b94739c2fd8407182
us flag
I don't think the error messages from rabbitmq you pasted are the actual root cause here, I see them in our clusters as well, they just seem to be reconnects after a while. Your nova-compute.log states that it couldn't connect to nova-conductor: `Lost connection to nova-conductor for reporting service status`. What does the nova-conductor.log say?
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.