Score:0

Mongodb data files corrupted when copied in

gr flag

I’m sure there is a straightforward solution, I’m running a Mongodb service thusly:

[Unit]
Description=Mongo server for Open Quartermaster. Version ${version}, using MongoDB tagged to "4".
Documentation=https://github.com/Epic-Breakfast-Productions/OpenQuarterMaster/tree/main/software/Infrastructure
After=docker.service
Wants=network-online.target docker.socket
Requires=docker.socket

[Service]
Type=simple
Restart=always
TimeoutSec=5m

#ExecStartPre=/bin/bash -c "/usr/bin/docker container inspect oqm_mongo 2> /dev/null || "
ExecStartPre=/bin/bash -c "/usr/bin/docker stop -t 10 oqm_infra_mongo || echo 'Could not stop mongo container'"
ExecStartPre=/bin/bash -c "/usr/bin/docker rm oqm_infra_mongo || echo 'Could not remove mongo container'"
ExecStartPre=/bin/bash -c "/usr/bin/docker pull mongo:4"

ExecStart=/bin/bash -c "/usr/bin/docker run \
                                 --name oqm_infra_mongo \
                                 -p=27017:27017 \
                                 -v /data/oqm/db/mongo/:/data/db  \
                                 mongo:4 mongod --replSet rs0"
ExecStartPost=/bin/bash -c "running=\"false\"; \
                            while [ \"$running\" != \"true\" ]; do \
                                sleep 1s; \
                                /usr/bin/docker exec oqm_infra_mongo mongo --eval \"\"; \
                                if [ \"$?\" = \"0\" ]; then \
                                    echo \"Mongo container running and available!\"; \
                                    running=\"true\"; \
                                fi \
                            done \
                            "
ExecStartPost=/bin/bash -c "/usr/bin/docker exec oqm_infra_mongo mongo --eval \"rs.initiate({'_id':'rs0', 'members':[{'_id':0,'host':'localhost:27017'}]})\" || echo 'Probably already initialized.'"

ExecStop=/bin/bash -c "/usr/bin/docker stop -t 10 oqm_infra_mongo || echo 'Could not stop mongo container'"
ExecStopPost=/bin/bash -c "/usr/bin/docker rm oqm_infra_mongo || echo 'Could not remove mongo container'"

[Install]
WantedBy=multi-user.target

I’m having docker map the host’s /data/oqm/db/mongo/ directory, and looks like this works fine; looking from the host I see the directory populated when run. This setup seems to work fine between restarts and persists fine.

However, when I try backing up and restoring the data by copying the data in this directory out then back in, when trying to bring the service back up, Mongo barfs, claiming the data is corrupted. Any ideas?

To be clear my backup and restore both happen when the service is stopped, and consists of:

Backing up:

  1. Stopping the service (using systemd)
  2. Copying out the files in data directory
  3. Starting the service back up (using systemd)

Restoring:

  1. Stopping the service (using systemd)
  2. nuking the existing files in data dir
  3. copying back in the previously copied out files
  4. Starting the service (using systemd) (fails)

I am trying this method as it is generally described in the docs: https://www.mongodb.com/docs/manual/core/backups/#back-up-with-cp-or-rsync

Snippet of error logs:


May 27 09:06:59 oqm-dev bash[41304]: {"t":{"$date":"2023-05-27T13:06:59.275+00:00"},"s":"I",  "c":"STORAGE",  "id":22315,   "ctx":"initandlisten","msg":"Opening WiredTiger","attr":{"config":"create,cache_size=11224M,session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000,close_scan_interval=10,close_handle_minimum=250),statistics_log=(wait=0),verbose=[recovery_progress,checkpoint_progress,compact_progress],"}}
May 27 09:06:59 oqm-dev bash[41304]: {"t":{"$date":"2023-05-27T13:06:59.686+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1685192819:686027][1:0x7fefd1197cc0], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 1 through 3"}}
May 27 09:06:59 oqm-dev bash[41304]: {"t":{"$date":"2023-05-27T13:06:59.719+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1685192819:719823][1:0x7fefd1197cc0], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 2 through 3"}}
May 27 09:06:59 oqm-dev bash[41304]: {"t":{"$date":"2023-05-27T13:06:59.749+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1685192819:749492][1:0x7fefd1197cc0], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 3 through 3"}}
May 27 09:06:59 oqm-dev bash[41304]: {"t":{"$date":"2023-05-27T13:06:59.801+00:00"},"s":"E",  "c":"STORAGE",  "id":22435,   "ctx":"initandlisten","msg":"WiredTiger error","attr":{"error":-31804,"message":"[1685192819:801911][1:0x7fefd1197cc0], txn-recover: __recovery_setup_file, 643: metadata corruption: files file:collection-0-6243490866083295563.wt and file:collection-0--9007965794334803376.wt have the same file ID 4: WT_PANIC: WiredTiger library panic"}}
May 27 09:06:59 oqm-dev bash[41304]: {"t":{"$date":"2023-05-27T13:06:59.801+00:00"},"s":"F",  "c":"-",        "id":23089,   "ctx":"initandlisten","msg":"Fatal assertion","attr":{"msgid":50853,"file":"src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp","line":481}}
May 27 09:06:59 oqm-dev bash[41304]: {"t":{"$date":"2023-05-27T13:06:59.802+00:00"},"s":"F",  "c":"-",        "id":23090,   "ctx":"initandlisten","msg":"\n\n***aborting after fassert() failure\n\n"}
May 27 09:06:59 oqm-dev bash[41304]: {"t":{"$date":"2023-05-27T13:06:59.802+00:00"},"s":"F",  "c":"CONTROL",  "id":4757800, "ctx":"initandlisten","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}

Update: After it was suggested, I doublechecked that permissions are being preserved, and they are definitely being preserved now. Unfortunately, still hitting the same issue

Ginnungagap avatar
gu flag
Are the permissions and owner/group properly restored?
Snappawapa avatar
gr flag
I don't do anything specifically to touch permissions, but noticing now from the container's perspective the file are all owned by `mongodb`, from the host they are all owned by `systemd-coredump`, which probably makes sense since the host has no 'mongodb' user
Score:0
gr flag

Figured it out.

Permissions might have definitely played a role, but what really did it was realizing that:

rm -rf "/some/dir/*" != rm -rf /some/dir*

Changing the command to rm -rf "$root/some/dir"/* did the trick

https://unix.stackexchange.com/questions/326584/rm-command-in-bash-script-does-not-work-with-variable

I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.