Score:0

Downtime on an EC2 instance

cn flag

Recently I had an issue on an EC2 instance. The site running there was unavailable for 2 hours:

CPU utilization for the past week:

The rest are around the time it happened:

The systemd journal around that period.

What I can see there? At around 20:31 it looks like things get slow:

Job execution of per-minute job scheduled for 20:30 delayed into subsequent minute 20:31. Skipping job run.

Jobs (cronyd) fail to start.

Jan 12 21:31:29 ip-172-xx-x-xx.xx-yy-z.compute.internal chronyd[24287]: Forward time jump detected!
Jan 12 21:33:21 ip-172-xx-x-xx.xx-yy-z.compute.internal chronyd[24287]: Can't synchronise: no selectable sources

The dhclient lines, usually they come together, but at that period it was like this:

Jan 12 20:46:21 ip-172-xx-x-xx.xx-yy-z.compute.internal dhclient[3066]: DHCPREQUEST on eth0 to 172.xx.x.xx port 67 (xid=0x7cb0e02d)
Jan 12 20:46:22 ip-172-xx-x-xx.xx-yy-z.compute.internal dhclient[3066]: DHCPACK from 172.xx.x.xx (xid=0x7cb0e02d)
Jan 12 21:06:23 ip-172-xx-x-xx.xx-yy-z.compute.internal dhclient[3066]: bound to 172.yy.y.yy -- renewal in 354 seconds.

Also:

Jan 12 21:47:22 ip-172-xx-x-xx.xx-yy-z.compute.internal dhclient[3066]: bound to 172.yy.y.yy -- renewal in -554 seconds.

And it looks like at 21:47 things get back to normal.

The docker containers running there restarted. I remember that their logs started closer to 10pm, probably at 21:47.

The sysstat log (/var/log/sa/sar12):

07:00:01 PM     all      3.77      0.00      0.53      0.00      0.53      0.00      0.11      0.00      0.00     95.05
07:00:01 PM       0      4.22      0.00      0.54      0.01      0.45      0.00      0.11      0.00      0.00     94.68
07:00:01 PM       1      3.33      0.00      0.53      0.00      0.61      0.00      0.10      0.00      0.00     95.43
07:10:01 PM     all      3.47      0.00      0.52      0.00      0.54      0.00      0.13      0.00      0.00     95.34
07:10:01 PM       0      4.01      0.00      0.53      0.00      0.48      0.00      0.10      0.00      0.00     94.88
07:10:01 PM       1      2.93      0.00      0.52      0.01      0.60      0.00      0.15      0.00      0.00     95.80
07:20:01 PM     all      1.89      0.00      0.47      0.00      0.46      0.00      0.10      0.00      0.00     97.08
07:20:01 PM       0      1.54      0.00      0.46      0.00      0.39      0.00      0.10      0.00      0.00     97.50
07:20:01 PM       1      2.24      0.00      0.48      0.00      0.53      0.00      0.10      0.00      0.00     96.65
07:30:01 PM     all      1.37      0.00      0.47      0.00      0.42      0.00      0.09      0.00      0.00     97.65
07:30:01 PM       0      1.55      0.00      0.46      0.00      0.36      0.00      0.08      0.00      0.00     97.54
07:30:01 PM       1      1.18      0.00      0.48      0.00      0.47      0.00      0.10      0.00      0.00     97.77
07:40:01 PM     all      1.32      0.00      0.47      0.00      0.41      0.00      0.10      0.00      0.00     97.71
07:40:01 PM       0      1.46      0.00      0.46      0.00      0.33      0.00      0.09      0.00      0.00     97.66
07:40:01 PM       1      1.18      0.00      0.47      0.00      0.48      0.00      0.10      0.00      0.00     97.77
07:50:01 PM     all      1.36      0.00      0.48      0.00      0.41      0.00      0.10      0.00      0.00     97.65
07:50:01 PM       0      1.14      0.00      0.45      0.00      0.33      0.00      0.11      0.00      0.00     97.96
07:50:01 PM       1      1.58      0.00      0.50      0.00      0.50      0.00      0.09      0.00      0.00     97.33
08:00:01 PM     all      2.17      0.00      0.52      0.01      0.52      0.00      0.12      0.00      0.00     96.66
08:00:01 PM       0      2.26      0.00      0.49      0.01      0.45      0.00      0.13      0.00      0.00     96.67
08:00:01 PM       1      2.08      0.00      0.55      0.01      0.60      0.00      0.12      0.00      0.00     96.65
08:10:01 PM     all      3.47      1.35      2.41      0.08      0.58      0.00      0.15      0.00      0.00     91.96
08:10:01 PM       0      3.28      1.11      2.38      0.07      0.50      0.00      0.15      0.00      0.00     92.51
08:10:01 PM       1      3.66      1.58      2.45      0.09      0.66      0.00      0.15      0.00      0.00     91.40

08:10:01 PM     CPU      %usr     %nice      %sys   %iowait    %steal      %irq     %soft    %guest    %gnice     %idle
08:20:01 PM     all      1.73      0.00      0.54      0.07      0.48      0.00      0.10      0.00      0.00     97.07
08:20:01 PM       0      1.94      0.00      0.58      0.07      0.40      0.00      0.10      0.00      0.00     96.90
08:20:01 PM       1      1.52      0.00      0.51      0.08      0.55      0.00      0.11      0.00      0.00     97.23
09:50:02 PM     all      2.11      0.11     50.63     43.63      0.09      0.00      0.02      0.00      0.00      3.41
09:50:02 PM       0      3.34      0.09     15.85     77.19      0.07      0.00      0.02      0.00      0.00      3.45
09:50:02 PM       1      0.93      0.12     83.90     11.54      0.11      0.00      0.02      0.00      0.00      3.37
10:00:01 PM     all      2.11      0.00      0.43      2.61      0.35      0.00      0.07      0.00      0.00     94.42
10:00:01 PM       0      1.87      0.00      0.45      2.73      0.25      0.00      0.07      0.00      0.00     94.63
10:00:01 PM       1      2.36      0.00      0.42      2.50      0.45      0.00      0.07      0.00      0.00     94.20
10:10:01 PM     all      0.80      0.00      0.33      0.00      0.29      0.00      0.06      0.00      0.00     98.52
10:10:01 PM       0      0.82      0.00      0.31      0.00      0.20      0.00      0.07      0.00      0.00     98.59
10:10:01 PM       1      0.77      0.00      0.35      0.00      0.37      0.00      0.06      0.00      0.00     98.45
10:20:01 PM     all      0.85      0.00      0.35      0.00      0.29      0.00      0.07      0.00      0.00     98.44
10:20:01 PM       0      0.85      0.00      0.34      0.00      0.21      0.00      0.07      0.00      0.00     98.53
10:20:01 PM       1      0.86      0.00      0.36      0.00      0.37      0.00      0.06      0.00      0.00     98.35
10:30:01 PM     all      1.41      0.00      0.38      0.00      0.33      0.00      0.08      0.00      0.00     97.79
10:30:01 PM       0      1.13      0.00      0.36      0.00      0.25      0.00      0.07      0.00      0.00     98.18
10:30:01 PM       1      1.69      0.00      0.40      0.00      0.42      0.00      0.09      0.00      0.00     97.40
10:40:01 PM     all      0.98      0.00      0.35      0.00      0.29      0.00      0.06      0.00      0.00     98.32
10:40:01 PM       0      0.70      0.00      0.33      0.00      0.22      0.00      0.06      0.00      0.00     98.69
10:40:01 PM       1      1.25      0.00      0.36      0.00      0.35      0.00      0.07      0.00      0.00     97.96
10:50:01 PM     all      0.65      0.00      0.34      0.00      0.28      0.00      0.06      0.00      0.00     98.68
10:50:01 PM       0      0.80      0.00      0.34      0.00      0.20      0.00      0.05      0.00      0.00     98.61
10:50:01 PM       1      0.50      0.00      0.34      0.00      0.35      0.00      0.06      0.00      0.00     98.75

There's a gap between 8:20 and 9:50, and only at 9:50 we see the load (idle 3%).

What might be relevant here, on Jan 04 I enabled time synchronization (timedatectl set-ntp true) because there was offset of 15 minutes:

System clock wrong by -910.996745 seconds

It's a t3a.medium instance. And I believe the credit specification was unlimited by then. At least that's what I saw on the next day. Anyways, the credit balance didn't hit the floor.

Can you explain it? What can I check?

To be fair, I can't be sure it can't be caused by the site or one of its components, but I haven't run into such issues.

UPD The issue might have been caused by memory leak in one of the containers. At least after making it run nokogiri tasks in separate processes, memory stopped growing and there were no similar incidents so far.

Tim avatar
gp flag
Tim
It's probably not the answer in this case, but I've seen Windows instances go down for Windows Updates in the past. Windows Update on EC2 can sometimes be fairly slow.
Score:1
jp flag

Looks like your instance got paused/suspended due to some issues with the physical host running your EC2 instance. Please remember that EC2 Instance level SLA is 99.5%. You probably want to enable monitoring for status check for your instances and automatic recovery.

cn flag
From the `sysstat` log it indeed looks like it was paused/suspended. But that was owing to the fact the `sa1` job failed to start during that time. If you look at the `systemd` log, there's no indication that it was suspended. I've added [another file](https://gist.github.com/x-yuri/e13937c2d32e1a48ce76f756fce395f5#file-timestamps-txt) to the gist where you can see the timestamps for the period where it could supposedly be suspended. As for status checks, the `StatusCheckFailed*` metrics are 0 since long before the time of the incident...
cn flag
...And on one hand, the fault might be on my part, but there are not many visitors for now and there is supposedly nothing to break this way. Well, at least I haven't run into such an issue. On the other hand, I can't be sure it can't be caused by the site or one of its components. Did you have any such incidents, especially those where the AWS UI was telling you that nothing broke?
jp flag
The instance lost network connection about 20:30 (flat NetworkOut on the graph) then the problem on the host was probably automatically detected and the instance was suspended and migrated to another physical host between 21:12 and 21:31 (when the time jump was detected).
cn flag
It should be noted that the traffic [never reached](https://i.imgur.com/Dm797xi.png) zero, but (at least) I'm not sure what exactly it measures, so that sounds plausible. It also explains the forward time jump. That's the best explanation so far :) Although that doesn't explain why it went unnoticed by the status checks.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.