Score:0

Ubuntu

Ubuntu 20.04 LTS NetworkManager.service fails to start

Bobchuck

10/28/22, 1:27 AM

I began having issues with NetworkManager.service and not having any internet connection at all several months ago. I would get Ubuntu error popups for this service failing to start, but a computer restart would get it started properly again and it didn't happen too often. It then started happening more frequently and restarting stopped working each time resulting in several attempts to have it start properly. I found someone who said that the command sudo systemctl restart NetworkManager.service would get it started again, and for a while this did the trick (though I had to run it almost every time I restarted the computer).

Just today though, this command no longer worked, produced an error, and now I cannot connect to the internet whatsoever from Ubuntu even after several computer restarts and shutdowns:

~$ sudo systemctl restart NetworkManager.service
Job for NetworkManager.service failed because a fatal signal was delivered causing the control process to dump core.
See "systemctl status NetworkManager.service" and "journalctl -xe" for details.

Checking the systemctl status of it, I get this:

~$ systemctl status NetworkManager.service
● NetworkManager.service - Network Manager
     Loaded: loaded (/lib/systemd/system/NetworkManager.service; enabled; vendor preset: enabled)
     Active: failed (Result: core-dump) since Sun 2021-06-27 14:40:30 EDT; 2min 9s ago
       Docs: man:NetworkManager(8)
    Process: 3222 ExecStart=/usr/sbin/NetworkManager --no-daemon (code=dumped, signal=BUS)
   Main PID: 3222 (code=dumped, signal=BUS)

Jun 27 14:40:30 user systemd[1]: NetworkManager.service: Scheduled restart job, restart counter is at 5.
Jun 27 14:40:30 user systemd[1]: Stopped Network Manager.
Jun 27 14:40:30 user systemd[1]: NetworkManager.service: Start request repeated too quickly.
Jun 27 14:40:30 user systemd[1]: NetworkManager.service: Failed with result 'core-dump'.
Jun 27 14:40:30 user systemd[1]: Failed to start Network Manager.

As for the journalctl -xe output, I have put all the log gave me at this pastebin link: https://pastebin.com/gTJMktN5 There's a lot of errors similar to above saying that it failed with a core-dump, but here's just one of the blocks that might be relevant:

-- A start job for unit NetworkManager.service has begun execution.
-- 
-- The job identifier is 1897.
Jun 27 14:40:28 user kernel: ata4.00: exception Emask 0x0 SAct 0x200000 SErr 0x0 action 0x0
Jun 27 14:40:28 user kernel: ata4.00: irq_stat 0x40000008
Jun 27 14:40:28 user kernel: ata4.00: failed command: READ FPDMA QUEUED
Jun 27 14:40:28 user kernel: ata4.00: cmd 60/08:a8:70:9a:41/00:00:5a:00:00/40 tag 21 ncq dma 4096 in
                                      res 41/40:00:74:9a:41/00:00:5a:00:00/00 Emask 0x409 (media error) <F>
Jun 27 14:40:28 user kernel: ata4.00: status: { DRDY ERR }
Jun 27 14:40:28 user kernel: ata4.00: error: { UNC }
Jun 27 14:40:28 user kernel: ata4.00: configured for UDMA/133
Jun 27 14:40:28 user kernel: sd 3:0:0:0: [sdb] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Jun 27 14:40:28 user kernel: sd 3:0:0:0: [sdb] tag#21 Sense Key : Medium Error [current] 
Jun 27 14:40:28 user kernel: sd 3:0:0:0: [sdb] tag#21 Add. Sense: Unrecovered read error - auto reallocate failed
Jun 27 14:40:28 user kernel: sd 3:0:0:0: [sdb] tag#21 CDB: Read(10) 28 00 5a 41 9a 70 00 00 08 00
Jun 27 14:40:28 user kernel: blk_update_request: I/O error, dev sdb, sector 1514248820 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jun 27 14:40:28 user kernel: ata4: EH complete
Jun 27 14:40:28 user kernel: ata4.00: exception Emask 0x0 SAct 0x4000000 SErr 0x0 action 0x0
Jun 27 14:40:28 user kernel: ata4.00: irq_stat 0x40000008
Jun 27 14:40:28 user kernel: ata4.00: failed command: READ FPDMA QUEUED
Jun 27 14:40:28 user kernel: ata4.00: cmd 60/08:d0:70:9a:41/00:00:5a:00:00/40 tag 26 ncq dma 4096 in
                                      res 41/40:00:74:9a:41/00:00:5a:00:00/00 Emask 0x409 (media error) <F>
Jun 27 14:40:28 user kernel: ata4.00: status: { DRDY ERR }
Jun 27 14:40:28 user kernel: ata4.00: error: { UNC }
Jun 27 14:40:28 user kernel: ata4.00: configured for UDMA/133
Jun 27 14:40:28 user kernel: sd 3:0:0:0: [sdb] tag#26 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Jun 27 14:40:28 user kernel: sd 3:0:0:0: [sdb] tag#26 Sense Key : Medium Error [current] 
Jun 27 14:40:28 user kernel: sd 3:0:0:0: [sdb] tag#26 Add. Sense: Unrecovered read error - auto reallocate failed
Jun 27 14:40:28 user kernel: sd 3:0:0:0: [sdb] tag#26 CDB: Read(10) 28 00 5a 41 9a 70 00 00 08 00
Jun 27 14:40:28 user kernel: blk_update_request: I/O error, dev sdb, sector 1514248820 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jun 27 14:40:28 user kernel: ata4: EH complete
Jun 27 14:40:28 user systemd[1]: NetworkManager.service: Main process exited, code=dumped, status=7/BUS
-- Subject: Unit process exited
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- An ExecStart= process belonging to unit NetworkManager.service has exited.

I've seen similar posts like this that had replies saying to update the kernel version and other things, but I'm currently running the latest there is on the 20.04 LTS version, and I wouldn't think that I would have to deviate much from it.

I'm running Ubuntu 20.04.2 LTS x86_64 with the kernel:

~$ uname -a
Linux user 5.8.0-59-generic #66~20.04.1-Ubuntu SMP Thu Jun 17 11:14:10 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

I also started experiencing frequent error popups for services that were never failing before today while I was collecting these logs. They were for the following services:

/usr/libexec/colord
/usr/libexec/tracker-extract
/usr/libexec/tracker-miner-fs
/usr/lib/packagekit/packagekitd

I don't know if they're related, but considering they started at the same time the restart command I was using stopped working, it seems likely there is a bigger issue. On top of these, restarting and shutting down the computer produces pages of errors scrolling too fast for me to read them during the shutdown sequence.

Any help towards debugging or finding a workaround would be appreciated.

Edits:

Here is the output of grep -i FPDMA /var/log/syslog*: https://pastebin.com/tazDug7H

Here is the output of dmesg. There were a few I/O errors in this one. For the record, the installation drive is /dev/sdb: https://pastebin.com/ctefUjUA

The output of fsck on the install drive:

~$ sudo fsck -f /dev/sdb2
fsck from util-linux 2.34
e2fsck 1.45.5 (07-Jan-2020)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/sdb2: 635347/61022208 files (1.4% non-contiguous), 29081215/244059648 blocks

screenshot of SMART test

5713

1 + 0

network-manager

networking

20.04

Score:0

Ubuntu

heynnema

10/28/22, 3:03 AM

NCQ

You're having disk NCQ errors...

Jun 27 14:40:28 user kernel: ata4.00: failed command: READ FPDMA QUEUED
Jun 27 14:40:28 user kernel: ata4.00: cmd 60/08:a8:70:9a:41/00:00:5a:00:00/40 tag 21 ncq dma 4096 in
                                      res 41/40:00:74:9a:41/00:00:5a:00:00/00 Emask 0x409 (media error) <F>
Jun 27 14:40:28 user kernel: ata4.00: status: { DRDY ERR }
Jun 27 14:40:28 user kernel: ata4.00: error: { UNC }

Native Command Queuing (NCQ) is an extension of the Serial ATA protocol allowing hard disk drives to internally optimize the order in which received read and write commands are executed.

Edit sudo -H gedit /etc/default/grub and change the following line to include this extra parameter. Then do sudo update-grub to write the changes to disk. Reboot. Monitor hangs/etc., and watch grep -i FPDMA /var/log/syslog* or dmesg for continued error messages.

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash libata.force=noncq"

fsck

boot to a Ubuntu Live DVD/USB in “Try Ubuntu” mode
open a terminal window by pressing Ctrl+Alt+T
type sudo fdisk -l
identify the /dev/sdXX device name for your "Linux Filesystem"
type sudo fsck -f /dev/sdXX, replacing sdXX with the number you found earlier
repeat the fsck command if there were errors
type reboot

SSD

Regarding your SanDisk SSD PLUS 1TB, check for a firmware update. Go to the SanDisk web site and download their Dashboard software. Windows required.

See https://kb.sandisk.com/app/answers/detail/a_id/15108/~/dashboard-support-information

Update #1:

Even though SMART says the SSD is ok, it's not. You have 6146 uncorrectable errors and 21388 uncorrectable ECC errors! Since you already changed the cable and updated the firmware, then the problem is either with the SATA port, or your SSD is bad.

0 + 19

Bobchuck

10/28/22, 3:42 AM

Thank you for the response. I'll try the NCQ part out now, but when you said "do NOT bad block a SSD", are you saying not to run those following commands on an SSD? My Ubuntu installation is in fact on an SSD.

0

Reply

heynnema

10/28/22, 3:51 AM

@Bobchuck What make/model SSD? Samsung?

0

Reply

Bobchuck

10/28/22, 4:13 AM

it's a SamDisk SSD PLUS 1TB. Also, editing that grub file and updating grub did not seem to change anything -- the NetworkManager still crashed on start. I'll add the logs from `syslog` and `dmesg` to my question since there might be more to see there now. Though interestingly I was able to the service to start by running `sudo systemctl start NetworkManager.service` (`restart` did nothing) both before and after editing the grub file, but I was probably just lucky.

0

Reply

heynnema

10/28/22, 4:15 AM

@Bobchuck After editing the GRUB file, and `sudo update-grub`, did you reboot the system?

0

Reply

Bobchuck

10/28/22, 4:25 AM

Yes, I edited the grub file, ran that command, and then rebooted the machine.

0

Reply

heynnema

10/28/22, 4:29 AM

@Bobchuck Use the `grep -i FPDMA /var/log/syslog*` to see if you get any hits after the last reboot. Also see the update in my answer.

0

Reply

heynnema

10/28/22, 4:31 AM

@Bobchuck Added `fsck` update to my answer.

0

Reply

Bobchuck

10/28/22, 4:39 AM

I'll run the fsck commands in the morning tomorrow as it's getting late here. As for the syslogs, I updated my question with the logs for that at the bottom of the post.

0

Reply

heynnema

10/28/22, 4:42 AM

@Bobchuck What time did you reboot? The last FPDMA error was logged at Jun 27 19:52:49.

0

Reply

Bobchuck

10/28/22, 4:53 AM

hm It would have been around midnight or past midnight, so about Jun 28 00:00:00

0

Reply

heynnema

10/28/22, 4:53 AM

@Bobchuck The SSD is still having a problem. Do the `fsck` and check the firmware. You may have a bad SSD or cable. Is the SSD an internal SATA drive? How many drives do you have? All internal SATA? Or external USB?

0

Reply

Bobchuck

10/28/22, 4:28 PM

I ran the fsck and put the output in my original post. I also checked the SMART status of the drive and it was ok, so it seems as though the drive is healthy. Regardless, I have 5 drives: 3 SSD and 2 HDD all internal SATA. only the SSD that the installation is on is mounted on startup.

0

Reply

heynnema

10/28/22, 4:46 PM

@Bobchuck Good job! Please show me screenshots of the SMART Data window. Also check the SSD firmware, using the link in my answer. On the SSD SATA cable, do you have a spare cable? If not, can you at least re-seat the cable at both ends? Do you have a power supply with enough power to run all of your SDD/HDDs?

0

Reply

Bobchuck

10/29/22, 12:45 AM

I added a screenshot of the SMART data window to my post. Neither changing the SATA cable nor updating the SSD firmware did anything. Yes I have enough power for the drives. I've had this current setup for almost 2 years as-is, and I'm commenting from my Windows drive within the same machine right now.

0

Reply

heynnema

10/29/22, 2:51 AM

@Bobchuck Thanks for the update. Even though SMART says the SSD is ok, it's not. You have **6146 uncorrectable errors** and **21388 uncorrectable ECC errors**! Since you already changed the cable and updated the firmware, then the problem is either with the SATA port, or your **SSD is bad**.

0

Reply

Bobchuck

10/29/22, 3:39 AM

oh thank you for clarifying that data for me. I had never used that test before. Unfortunate that the drive is only 2 years old. I'll swap the port it's on and run the test again to confirm. Just so I understand the situation completely, if the problem is indeed with the drive, then the current hypothesis is that the drive is failing to read the data properly causing these errors, and has possibly gotten worse resulting in more frequent errors?

0

Reply

heynnema

10/29/22, 4:46 AM

@Bobchuck Correct. Once in the **SMART Data & Tests** area, you can also run the short/long tests, and observe if those error counts continue to increase.

0

Reply

Bobchuck

10/29/22, 4:51 AM

I appreciate the help. It's unfortunate it had to end like this, but luckily the whole drive hasn't failed yet. I'll play around with some of those tests and then start looking around for a new drive.

0

Reply

heynnema

10/29/22, 4:59 AM

@Bobchuck You may wish to check on SanDisks warranty on that drive. Maybe they'll replace it for you.

0

Reply