For quite a few months everything was working fine and this problem didn't exists. After a crash loop in one of the containers which started producing a lot of logs the server got full. That problem got solved but now when I run my docker stack including elasticsearch, after a few hours the disk starts to fill 100%, it goes from 20GB usage to 75GB (100%) in a matter of minutes.
It cannot be the old logs because those have been removed from the system and elasticsearch is configured to perform ILM on indices, so it doesn't keep more than a few GB of data (rolls and deletes after a few days). Also important to note:
While the df -h
shows the disk is completely full. The elasticsearch volume which is mounted to: /usr/share/elasticsearch/data
in the container has but a few GBs (around 5GB) of volume. Using du -h -d1
At the same time, the du -h -d1
on /
show only around 20GB of disk usage! So it's not clear where the extra ~50GB bulk of volume resides!
When I remove the elasticsearch service from the stack the disk usage goes back to 20GB instantly.
I Tried:
- Remove the node from swarm, no container running and prune everything including volumes. Disk usage falls. When I re-join and run the stack with elasticsearch the problem comes back.
- Do as suggested here and mount the / to /mnt.
du
command showed no difference. Still 20GB while df
showed 100% full disk.
Server Resources:
- 75GB disk space
- 4 CPU cores
- 16GB RAM
The server is a centos7 and it's a manager in a swarm (Docker swarm) and the elasticsearch instance is "pinned" to this server with deployment constraints. Other containers run on this server too.
The swarm has 4 nodes: 3 managers which perform as workers too and 1 other worker node.