I've been dealing with this issue for weeks now.
I have the followin scenario:
couchdb2.3.1-A <===> couchdb2.3.1-B <===> couchdb3.1.1-A <===> couchdb3.1.1-B
where <===> represents two pull replications, one configured on each side. i.e: couchdb1 pulls from couchdb2 and viceversa.
Couchdb is running in docker containers
If a write is made at couchdb2.3.1-A, it has to make it through all servers until it comes to couchdb3.1.1-B.
Al of them has an exclusive HDD. Couchdb does not share disk with any other service.
couchdb2.3.1 A and B have no problem.
couchdb3.1.1-A gradually started to increase disk latency over time. So we stopped making write requests to it and started to talk only with couchdb3.1.1-B. couchdb3.1.1-A still receives writes but only by replication protocol. Disk latency did not change.
Changes we've made since problem started:
- Upgraded kernel from
4.15.0-55-generic to 5.4.0-88-generic
- Upgraded ubuntu from
18.04 to 20.04
- Deleted
_global_changes database from couchdb3.1.1-A
More info:
- Couchdb is using docker local-persist volumes.
- Disks are WD Purple for
2.3.1 couchdbs and WD Black for 3.1.1 couchdbs.
- We have only one database of
88GiB and 2 views: one of 22GB and a little one of 30MB (highly updated)
docker stats shows that couchdb3.1.1 uses lot of memory compared to 2.3.1:
3.5GiB for couchdb3.1.1-A (not receiving direct write requests)
8.0GiB for couchdb3.1.1-A (receiving both read and write requests)
226MiB for 2.3.1-A
552MiB for 2.3.1-B
- Database compaction is run at night. Problem only occurs over day, when most of the writes are made.
- Most of config is default.
Latency graph from munin monitoring:
disk latency
Any help is appreciated.