Score:0

Understanding EBSByteBalance% in AWS RDS gp3 volumes

br flag

I am troubleshooting an AWS RDS Postgres instance that has been restarted by AWS several times in the last few days, very likely due to resource constraints. It's a testing DB that usually doesn't do much but we recently put some higher load onto it. I found that the DB's EBS volume (200GB gp3) depleted its throughput credits and that the times of the DB restarts coincided pretty well with the EBSByteBalance% metric reaching zero. Then when the DB gets restarted, the volume apparently gets a fresh set of burst credits as can be seen in the screenshot below:

EBSByteBalance% 3 days

The credits now drop slightly slower as we have eased the load on the DB but they are still dropping. When I look at the current read and write throughput metrics, they seem to sum up to just about 5 to 7 MiB/s with occasional spikes:

ReadThroughput 3 hours WriteThroughput 3 hours

Based on the information found here at Amazon RDS DB instance storage the baseline throughput for a gp3 volume below 400gb should be 125MiB/s. So can anyone help me explain why the EBSByteBalance% metric keeps decreasing in this scenario? Thanks!

Tim avatar
gp flag
Tim
Without looking into this, I wonder if the volume can do that performance, but with your instance size you're running out of credits? Restarting the database is really unusual, how do you know it's being restarted?
br flag
@Tim The instance also runs out of CPU credits but since it is a T4G instance it is set to run in unlimited mode, so additional CPU credits get purchased once the burst balance reaches zero, so there is no throtteling on the DB CPU. And I know the DB gets restarted because RDS shows this under logs & events: https://imgur.com/f0zK7am The Postgres logfile has entries about the DB shutting down as well but nothing I could find that would hint to the cause...
Tim avatar
gp flag
Tim
I meant to say running out of EBS credits, rather than CPU credits. The DB restarting sounds like a bug. I'd check CloudTrail to see if the RebootDBInstance API call was made and if so its source. For this type of thing if you don't get a good answer here I think it's worth spending the money on AWS Support, they're excellent and have access to information and logs the customer can't see. Developer support may be sufficient, and you can take it for a month.
br flag
@Tim Interesting idea on the API call. Unfortunately I could not find anything in my CloudTrail logs... But I just confirmed my suspicion: Credits on my EBS volume ran out and the DB restarted again. I found reports that resource constrain can cause the RDS instance to reboot. I just still would like to know why the EBSByteBalance% metric continues to drop despite read and write throughput to be way below the specified baseline performance...
Tim avatar
gp flag
Tim
What's your RDS instance type? I wonder if you're below the gp3 limits but above the instance limits. If we can't solve this, like I said earlier, take it to AWS Support who will probably reply with an answer to your question within 24 - 48 with the correct answer as they have access to a lot more information than the user.
br flag
@Tim Yep that was precisely it. I wasn't aware the instance type was also imposing I/O and throughput limits additional to the limits of the volume. AWS support confirmed this. See my answer below for anyone else finding themselves in a similar position. Thanks for your help and suggestions!
Score:2
br flag

Okay, I followed @Tim's advice and contacted AWS support. They clarified the following:

Kindy be informed the metrics 'EBSIOBalance%' and 'EBSByteBalance%' are instance class metrics. Please note that GP3 volumes do not use burst performance, Hence the metrics refers to instance class burst performance and not volume. EBSIOBalance% monitors the instance I/O burst bucket, and EBSByteBalance% monitors the instance byte burst bucket. These metrics give information about the percentage of I/O or bytes credits remaining in the respective burst buckets. The metrics are expressed as a percentage, where 100% means that the instance has accumulated the maximum number of credits.

So what happened was that the T4G DB instance class also has an I/O and throughput limit that in our case was just around 10 MB/s. I was not aware of this and had a very hard time finding these performance numbers online. But for anyone wondering in the future they can be found here: https://instances.vantage.sh/rds/ They also confirmed that under resource constraints the RDS instance may reboot and see this as the obvious explanation for the behaviour we witnessed.

So the mystery is solved in our case. Hope this helps someone in the future

Tim avatar
gp flag
Tim
This page documents the behavior, but it's not the simplest page to understand https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-optimized.html
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.