I'll preface this by saying I'm fairly new to working with AWS.
Yesterday I deployed a new version of one of our services to our test environment.
Most people are on holidays at the moment, so there isn't a lot of usage of this particular container.
The new version I deployed had some changes to logger configuration, the logger uses a new encoder that outputs logstash json format logs. So no business logic changes.
I deployed the service and it timed out because the health check failed twice (trying to run two instances).
I added 30 seconds to the wait time to see if it would deploy, it hung for a long time with CPU at 100%, but eventually stabilised.
However, once it stabilised, the CPU was running between 0% and 10% and seemed quite erratic. I let it run for a while then decided to revert to the previous image and investigate what the issue might be.
When I reverted to the older image the same thing happened. Spiked at 100% for a while, then stabilised, with the CPU active with around 10% load.
I left it over night, but it's still the same this morning.
What can the issue be? The old image worked fine before I deployed the new one.
We're using ECS, running Tasks.
I should note that the service is working as expected, I can query the API's and I get results quickly.
This is what the health graphs looked like this morning:
The flat part of the graph is using the original task, the first spike is deploying the new task, and the second spike is re-deploying the original task.
What could be the issue here?
Here is the same graph over a longer time period: