I have an AWS Application Load Balancer configured with EC2 and an auto-scaling group. The EC2 instances run a Windows+IIS web server. The Web Server connects to a database.
It has happened in some situations (once every 2 months) that the Health Checks for the ALB start to detect the application as unhealthy and take the EC2 instances down. There are always at least 2 instances running, and this happens for all instances at the same time. I am trying to understand why this is happening and I cannot find any useful logs or indications of where this is coming from.
See how the instances are dropping to zero all of a sudden on 12/6:
Zoomed-in:
The EC2 instances are terminated as:
The Health Check is configured to ping a page that does not query the database, so a bottleneck in the database doesn't seem the likely cause.
When that happens, the response time skyrockets:
And also as measured by NewRelic:
Note a few things:
- all phases of the response are slower (Redis time, .NET time, etc)
- it happens to all servers are the same time, so unlikely to be a problem with within the server
- it always happened outside of business hours when load is low
Auto-Scaling configurations:
Minimum capacity=2
Maximum capacity=15
Instances distribution= 50% On-Demand, 50% Spot
Include On-Demand base capacity=Designate the first 1 instances as On-Demand
On-Demand allocation strategy=Prioritized
Spot allocation strategy=Lowest price - diversified across the 10 lowest priced pools
Capacity rebalance=Off
Instance scale-in protection=Not protected from scale in
Termination policies=Default
Default cooldown=300
Target Group Configurations:
Protocol=HTTPS
Path=/path/to/login/page
Port=Traffic port
Healthy threshold=2 consecutive health check successes
Unhealthy threshold=4 consecutive health check failures
Timeout=20 seconds
Interval=25 seconds
Success codes=200