Score:0

Ubuntu 20.04 LTS NetworkManager.service fails to start

za flag

I began having issues with NetworkManager.service and not having any internet connection at all several months ago. I would get Ubuntu error popups for this service failing to start, but a computer restart would get it started properly again and it didn't happen too often. It then started happening more frequently and restarting stopped working each time resulting in several attempts to have it start properly. I found someone who said that the command sudo systemctl restart NetworkManager.service would get it started again, and for a while this did the trick (though I had to run it almost every time I restarted the computer).

Just today though, this command no longer worked, produced an error, and now I cannot connect to the internet whatsoever from Ubuntu even after several computer restarts and shutdowns:

~$ sudo systemctl restart NetworkManager.service
Job for NetworkManager.service failed because a fatal signal was delivered causing the control process to dump core.
See "systemctl status NetworkManager.service" and "journalctl -xe" for details.

Checking the systemctl status of it, I get this:

~$ systemctl status NetworkManager.service
● NetworkManager.service - Network Manager
     Loaded: loaded (/lib/systemd/system/NetworkManager.service; enabled; vendor preset: enabled)
     Active: failed (Result: core-dump) since Sun 2021-06-27 14:40:30 EDT; 2min 9s ago
       Docs: man:NetworkManager(8)
    Process: 3222 ExecStart=/usr/sbin/NetworkManager --no-daemon (code=dumped, signal=BUS)
   Main PID: 3222 (code=dumped, signal=BUS)

Jun 27 14:40:30 user systemd[1]: NetworkManager.service: Scheduled restart job, restart counter is at 5.
Jun 27 14:40:30 user systemd[1]: Stopped Network Manager.
Jun 27 14:40:30 user systemd[1]: NetworkManager.service: Start request repeated too quickly.
Jun 27 14:40:30 user systemd[1]: NetworkManager.service: Failed with result 'core-dump'.
Jun 27 14:40:30 user systemd[1]: Failed to start Network Manager.

As for the journalctl -xe output, I have put all the log gave me at this pastebin link: https://pastebin.com/gTJMktN5 There's a lot of errors similar to above saying that it failed with a core-dump, but here's just one of the blocks that might be relevant:

-- A start job for unit NetworkManager.service has begun execution.
-- 
-- The job identifier is 1897.
Jun 27 14:40:28 user kernel: ata4.00: exception Emask 0x0 SAct 0x200000 SErr 0x0 action 0x0
Jun 27 14:40:28 user kernel: ata4.00: irq_stat 0x40000008
Jun 27 14:40:28 user kernel: ata4.00: failed command: READ FPDMA QUEUED
Jun 27 14:40:28 user kernel: ata4.00: cmd 60/08:a8:70:9a:41/00:00:5a:00:00/40 tag 21 ncq dma 4096 in
                                      res 41/40:00:74:9a:41/00:00:5a:00:00/00 Emask 0x409 (media error) <F>
Jun 27 14:40:28 user kernel: ata4.00: status: { DRDY ERR }
Jun 27 14:40:28 user kernel: ata4.00: error: { UNC }
Jun 27 14:40:28 user kernel: ata4.00: configured for UDMA/133
Jun 27 14:40:28 user kernel: sd 3:0:0:0: [sdb] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Jun 27 14:40:28 user kernel: sd 3:0:0:0: [sdb] tag#21 Sense Key : Medium Error [current] 
Jun 27 14:40:28 user kernel: sd 3:0:0:0: [sdb] tag#21 Add. Sense: Unrecovered read error - auto reallocate failed
Jun 27 14:40:28 user kernel: sd 3:0:0:0: [sdb] tag#21 CDB: Read(10) 28 00 5a 41 9a 70 00 00 08 00
Jun 27 14:40:28 user kernel: blk_update_request: I/O error, dev sdb, sector 1514248820 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jun 27 14:40:28 user kernel: ata4: EH complete
Jun 27 14:40:28 user kernel: ata4.00: exception Emask 0x0 SAct 0x4000000 SErr 0x0 action 0x0
Jun 27 14:40:28 user kernel: ata4.00: irq_stat 0x40000008
Jun 27 14:40:28 user kernel: ata4.00: failed command: READ FPDMA QUEUED
Jun 27 14:40:28 user kernel: ata4.00: cmd 60/08:d0:70:9a:41/00:00:5a:00:00/40 tag 26 ncq dma 4096 in
                                      res 41/40:00:74:9a:41/00:00:5a:00:00/00 Emask 0x409 (media error) <F>
Jun 27 14:40:28 user kernel: ata4.00: status: { DRDY ERR }
Jun 27 14:40:28 user kernel: ata4.00: error: { UNC }
Jun 27 14:40:28 user kernel: ata4.00: configured for UDMA/133
Jun 27 14:40:28 user kernel: sd 3:0:0:0: [sdb] tag#26 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Jun 27 14:40:28 user kernel: sd 3:0:0:0: [sdb] tag#26 Sense Key : Medium Error [current] 
Jun 27 14:40:28 user kernel: sd 3:0:0:0: [sdb] tag#26 Add. Sense: Unrecovered read error - auto reallocate failed
Jun 27 14:40:28 user kernel: sd 3:0:0:0: [sdb] tag#26 CDB: Read(10) 28 00 5a 41 9a 70 00 00 08 00
Jun 27 14:40:28 user kernel: blk_update_request: I/O error, dev sdb, sector 1514248820 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jun 27 14:40:28 user kernel: ata4: EH complete
Jun 27 14:40:28 user systemd[1]: NetworkManager.service: Main process exited, code=dumped, status=7/BUS
-- Subject: Unit process exited
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- An ExecStart= process belonging to unit NetworkManager.service has exited.

I've seen similar posts like this that had replies saying to update the kernel version and other things, but I'm currently running the latest there is on the 20.04 LTS version, and I wouldn't think that I would have to deviate much from it.

I'm running Ubuntu 20.04.2 LTS x86_64 with the kernel:

~$ uname -a
Linux user 5.8.0-59-generic #66~20.04.1-Ubuntu SMP Thu Jun 17 11:14:10 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

I also started experiencing frequent error popups for services that were never failing before today while I was collecting these logs. They were for the following services:

/usr/libexec/colord
/usr/libexec/tracker-extract
/usr/libexec/tracker-miner-fs
/usr/lib/packagekit/packagekitd

I don't know if they're related, but considering they started at the same time the restart command I was using stopped working, it seems likely there is a bigger issue. On top of these, restarting and shutting down the computer produces pages of errors scrolling too fast for me to read them during the shutdown sequence.

Any help towards debugging or finding a workaround would be appreciated.

Edits:

Here is the output of grep -i FPDMA /var/log/syslog*: https://pastebin.com/tazDug7H

Here is the output of dmesg. There were a few I/O errors in this one. For the record, the installation drive is /dev/sdb: https://pastebin.com/ctefUjUA

The output of fsck on the install drive:

~$ sudo fsck -f /dev/sdb2
fsck from util-linux 2.34
e2fsck 1.45.5 (07-Jan-2020)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/sdb2: 635347/61022208 files (1.4% non-contiguous), 29081215/244059648 blocks

screenshot of SMART test

Score:0
ru flag

NCQ

You're having disk NCQ errors...

Jun 27 14:40:28 user kernel: ata4.00: failed command: READ FPDMA QUEUED
Jun 27 14:40:28 user kernel: ata4.00: cmd 60/08:a8:70:9a:41/00:00:5a:00:00/40 tag 21 ncq dma 4096 in
                                      res 41/40:00:74:9a:41/00:00:5a:00:00/00 Emask 0x409 (media error) <F>
Jun 27 14:40:28 user kernel: ata4.00: status: { DRDY ERR }
Jun 27 14:40:28 user kernel: ata4.00: error: { UNC }

Native Command Queuing (NCQ) is an extension of the Serial ATA protocol allowing hard disk drives to internally optimize the order in which received read and write commands are executed.

Edit sudo -H gedit /etc/default/grub and change the following line to include this extra parameter. Then do sudo update-grub to write the changes to disk. Reboot. Monitor hangs/etc., and watch grep -i FPDMA /var/log/syslog* or dmesg for continued error messages.

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash libata.force=noncq"

fsck

  • boot to a Ubuntu Live DVD/USB in “Try Ubuntu” mode
  • open a terminal window by pressing Ctrl+Alt+T
  • type sudo fdisk -l
  • identify the /dev/sdXX device name for your "Linux Filesystem"
  • type sudo fsck -f /dev/sdXX, replacing sdXX with the number you found earlier
  • repeat the fsck command if there were errors
  • type reboot

SSD

Regarding your SanDisk SSD PLUS 1TB, check for a firmware update. Go to the SanDisk web site and download their Dashboard software. Windows required.

See https://kb.sandisk.com/app/answers/detail/a_id/15108/~/dashboard-support-information

Update #1:

Even though SMART says the SSD is ok, it's not. You have 6146 uncorrectable errors and 21388 uncorrectable ECC errors! Since you already changed the cable and updated the firmware, then the problem is either with the SATA port, or your SSD is bad.

Bobchuck avatar
za flag
Thank you for the response. I'll try the NCQ part out now, but when you said "do NOT bad block a SSD", are you saying not to run those following commands on an SSD? My Ubuntu installation is in fact on an SSD.
heynnema avatar
ru flag
@Bobchuck What make/model SSD? Samsung?
Bobchuck avatar
za flag
it's a SamDisk SSD PLUS 1TB. Also, editing that grub file and updating grub did not seem to change anything -- the NetworkManager still crashed on start. I'll add the logs from `syslog` and `dmesg` to my question since there might be more to see there now. Though interestingly I was able to the service to start by running `sudo systemctl start NetworkManager.service` (`restart` did nothing) both before and after editing the grub file, but I was probably just lucky.
heynnema avatar
ru flag
@Bobchuck After editing the GRUB file, and `sudo update-grub`, did you reboot the system?
Bobchuck avatar
za flag
Yes, I edited the grub file, ran that command, and then rebooted the machine.
heynnema avatar
ru flag
@Bobchuck Use the `grep -i FPDMA /var/log/syslog*` to see if you get any hits after the last reboot. Also see the update in my answer.
heynnema avatar
ru flag
@Bobchuck Added `fsck` update to my answer.
Bobchuck avatar
za flag
I'll run the fsck commands in the morning tomorrow as it's getting late here. As for the syslogs, I updated my question with the logs for that at the bottom of the post.
heynnema avatar
ru flag
@Bobchuck What time did you reboot? The last FPDMA error was logged at Jun 27 19:52:49.
Bobchuck avatar
za flag
hm It would have been around midnight or past midnight, so about Jun 28 00:00:00
heynnema avatar
ru flag
@Bobchuck The SSD is still having a problem. Do the `fsck` and check the firmware. You may have a bad SSD or cable. Is the SSD an internal SATA drive? How many drives do you have? All internal SATA? Or external USB?
Bobchuck avatar
za flag
I ran the fsck and put the output in my original post. I also checked the SMART status of the drive and it was ok, so it seems as though the drive is healthy. Regardless, I have 5 drives: 3 SSD and 2 HDD all internal SATA. only the SSD that the installation is on is mounted on startup.
heynnema avatar
ru flag
@Bobchuck Good job! Please show me screenshots of the SMART Data window. Also check the SSD firmware, using the link in my answer. On the SSD SATA cable, do you have a spare cable? If not, can you at least re-seat the cable at both ends? Do you have a power supply with enough power to run all of your SDD/HDDs?
Bobchuck avatar
za flag
I added a screenshot of the SMART data window to my post. Neither changing the SATA cable nor updating the SSD firmware did anything. Yes I have enough power for the drives. I've had this current setup for almost 2 years as-is, and I'm commenting from my Windows drive within the same machine right now.
heynnema avatar
ru flag
@Bobchuck Thanks for the update. Even though SMART says the SSD is ok, it's not. You have **6146 uncorrectable errors** and **21388 uncorrectable ECC errors**! Since you already changed the cable and updated the firmware, then the problem is either with the SATA port, or your **SSD is bad**.
Bobchuck avatar
za flag
oh thank you for clarifying that data for me. I had never used that test before. Unfortunate that the drive is only 2 years old. I'll swap the port it's on and run the test again to confirm. Just so I understand the situation completely, if the problem is indeed with the drive, then the current hypothesis is that the drive is failing to read the data properly causing these errors, and has possibly gotten worse resulting in more frequent errors?
heynnema avatar
ru flag
@Bobchuck Correct. Once in the **SMART Data & Tests** area, you can also run the short/long tests, and observe if those error counts continue to increase.
Bobchuck avatar
za flag
I appreciate the help. It's unfortunate it had to end like this, but luckily the whole drive hasn't failed yet. I'll play around with some of those tests and then start looking around for a new drive.
heynnema avatar
ru flag
@Bobchuck You may wish to check on SanDisks warranty on that drive. Maybe they'll replace it for you.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.