Score:1

AWS Application Load Balancer bringing ASP.NET application down

ng flag

I have an AWS Application Load Balancer configured with EC2 and an auto-scaling group. The EC2 instances run a Windows+IIS web server. The Web Server connects to a database.

It has happened in some situations (once every 2 months) that the Health Checks for the ALB start to detect the application as unhealthy and take the EC2 instances down. There are always at least 2 instances running, and this happens for all instances at the same time. I am trying to understand why this is happening and I cannot find any useful logs or indications of where this is coming from.


See how the instances are dropping to zero all of a sudden on 12/6:

in service instances

Zoomed-in:

in service instances, zoomed in

The EC2 instances are terminated as:

termination reason

The Health Check is configured to ping a page that does not query the database, so a bottleneck in the database doesn't seem the likely cause.

When that happens, the response time skyrockets:

request response time

And also as measured by NewRelic:

newrelic response time

Note a few things:

  • all phases of the response are slower (Redis time, .NET time, etc)
  • it happens to all servers are the same time, so unlikely to be a problem with within the server
  • it always happened outside of business hours when load is low

Auto-Scaling configurations:

Minimum capacity=2
Maximum capacity=15
Instances distribution= 50% On-Demand, 50% Spot
Include On-Demand base capacity=Designate the first 1 instances as On-Demand
On-Demand allocation strategy=Prioritized
Spot allocation strategy=Lowest price - diversified across the 10 lowest priced pools
Capacity rebalance=Off
Instance scale-in protection=Not protected from scale in
Termination policies=Default
Default cooldown=300

Target Group Configurations:

Protocol=HTTPS
Path=/path/to/login/page
Port=Traffic port
Healthy threshold=2 consecutive health check successes
Unhealthy threshold=4 consecutive health check failures
Timeout=20 seconds
Interval=25 seconds
Success codes=200
Tim avatar
gp flag
Tim
Could it be something like Windows Update rebooting the servers after doing patching? To mitigate that you might be able to increase the unhealthy threshold to give the instances more time to recover. I wonder if you can stagger windows update times so one instance stays healthy. To diagnose further it would be easiest to somehow "quarantine" servers that fail health checks for manual inspection. Pushing server logs to Cloudwatch Logs might help so long as the logs are pushed promptly.
ng flag
Thanks. How do I do that? It doesn't happen often and when it does the instances are immediately terminated as soon as they become unhealthy.
Tim avatar
gp flag
Tim
I don't know how to do it, I would have to do some research, which you can look into. The first thing to do though is to change your image to push logs to Cloudwatch logs as quickly as possible, that way at least you can see what the server is doing before the health checks fail. I would push windows and application logs.
cn flag
Given the reason is "user initiated shutdown" this sounds like a windows update or something else happening. Or some other scheduled task - are you working in an account that is part of an AWS organization that might have stuff running? My last employer had some lambdas that would shutdown instances based on tags...
ng flag
There are no other things running that could affect that AFAIK. The Windows Update maybe could be if all instances updated at the same time, but since some of the newly created instances were failing as well (until 30 minutes later when all of a sudden started working), it seems very unlikely.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.