Been googling like crazy and can't find an answer. We have three AZs/subnets since we're in Ohio. But this diagram is close enough to explain the issue.
We've set up squid proxies to filter outbound traffic from one of our services.
- For each AZ, app servers are in a private subnet.
- Then there's a proxy in each public subnet for that AZ.
- The route table for the private subnet points 0.0.0.0 to the ENI of the proxy in the corresponding public subnet.
Over time, outbound traffic from each subnet died. It took us a bit to figure out what was going wrong so as each subnet died, we removed the instances in that subnet from the ALB for the service and motored on with a hobbled service while we researched. Yesterday the third subnet died and we decided to "route around" the proxies directly to the NAT gateway for each subnet. When we got to the route table, we noticed the ENI of each proxy was listed as a blackhole.
We've inspected
- Proxy instance logs
- ENI allocation times, and
- Cloudtrail logs
...looking for any indication as to why the ENIs had become invalid breaking our default routes. Nothing useful at all.
- The instances have been up for over three weeks
- The ENI allocation time stamp matches up with the instance creation time
- The boot logs don't show any reboots
- Cloudtrail doesn't show any modifications to the ENIs / instances.
We're stumped. How can our route table "suddenly" contain a route to an ENI that doesn't exist?