What is the best way to configure an ALB to deal with a regional outage?

Quarkly

6/22/24, 2:33 PM

We have a basic ALB with four availability zones, all in us-east-1[abcd]. Last week, we were effected by this outage at Amazon:

[03:42 PM PDT] Between 11:49 AM PDT and 3:37 PM PDT, we experienced increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. Our engineering teams were immediately engaged and began investigating. We quickly narrowed down the root cause to be an issue with a subsystem responsible for capacity management for AWS Lambda, which caused errors directly for customers (including through API Gateway) and indirectly through the use of other AWS services. Additionally, customers may have experienced authentication or sign-in errors when using the AWS Management Console, or authenticating through Cognito or IAM STS.

My question is: how fault-tolerant is ALB if all of your availability zones are in the same region? For anyone who has knowledge of this outage, would selecting a zone in Boston or Atlanta given us better failover than chosing all the zones from us-east-1*?

2 + 0

load-balancing

amazon-web-services

Score:2

Server

mfinni

6/22/24, 3:44 PM

An ALB can only LB between zones within a region. If your whole region is impacted, you need to have HA or failover to another region.

There's no single region you can pick that will never have an outage.

+ 0

Score:2

Server

Tim

6/22/24, 7:41 PM

An ALB can't cope with a regional outage. An ALB has nodes in each AZ of a single region.

To cope with a regional outage you need to use multiple regions. You direct traffic between regions using Route53, using whichever type of split you prefer, based on Route53 health checks. You could:

Direct 100% of your traffic to your primary region while it's working
Split the traffic 50 / 50 between two regions. This might reduce user latency but also could make database consistency more challenging.
Split traffic 95 / 5 just to prove the second region is always working
You could alternate between regions for each deployment, blue / green style

The other region can be standby, pilot light, or hot.

Standby: very few, perhaps a load balancer with auto-scaling scaled down to zero (I've not tried this but I think it's possible)
Pilot light: a very small amount of resources
Hot: scaled to take full production load

+ 0

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: What is the best way to configure an ALB to deal with a regional outage?

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.