ECS restarts due to health_check failure when multiple other requests are slow to return

Question

Score:1

Server

ECS restarts due to health_check failure when multiple other requests are slow to return

Zev

6/11/23, 8:53 PM

We noticed that our ECS Fargate backend services restart due to a health check response timeout:

(service our-site-com-stack-BackendApiServiceStack...) (port 8000) is unhealthy in (target-group arn:aws:elasticloadbalancing:us-east-1:1234:targetgroup/dev-d-ABC-ABC123/ABC123) due to (reason Request timed out).

We are trying to figure out how to conduct a health_check on our application for ECS that won't needlessly restart our services whenever the database gets busy (or other slow requests are pending).

We originally felt the situation may be similar to that which is described here: https://cloudsoft.io/blog/consequences-of-bad-health-checks-in-aws-application-load-balancer. Basically, that if our database was busy/slow, then the request could timeout.

However, we modified the health_check to not hit our RDS postgres database and even tried shutting off our database. We are able to reach the endpoint even with the database off but we no longer can reach it when we trigger as few as 7 requests that will timeout (such as login requests with the database down) or a similar number of requests that will be slow to return (with the database up).

In our AWS Application Stack, Route 53 is used to route traffic to our CloudFront distribution. CloudFront routes traffic for this endpoint to our Application Load Balancer for the Django application.

Our health check is part of our Django application and basically just returns a 200 response:

def health_check(request):
    response = JsonResponse({"message": "OK"})
    return response

Here's how our health check is setup in CDK:

        self.https_listener = self.alb.add_listener(
            "HTTPSListener",
            port=443,
            certificates=[scope.certificate],
            open=True,
        )

        scope.https_listener.add_targets(
            "BackendTarget",
            port=80,
            targets=[self.backend_service],
            priority=2,
            path_patterns=["*"],
            health_check=elbv2.HealthCheck(
                healthy_http_codes="200-299",
                path="/api/core/health-check/",
            ),
        )

The command that starts our production server is:

GEVENT_RESOLVER=ares gunicorn -t 1000 -k gevent -w 4 -b 0.0.0.0:8000 backend.wsgi

During an unrelated test, we were able to reproduce the same issue using Daphne:

daphne -b 0.0.0.0 -p 8000 backend.asgi:application

141

0 + 0

healthcheck

rds

amazon-ecs

aws-fargate

ECS restarts due to health_check failure when multiple other requests are slow to return

Post an answer