Score:0

How is detection time calculated in the Google Site Reliability Engineering Workbook?

cn flag

In the second SLO alerting example of the Site Reliability Engineering workbook, the following statement is made:

To keep the rate of alerts manageable, you decide to be notified only if an event consumes 5% of the 30-day error budget—a 36-hour window

It seems they are implying that a 36-hour window is derived from 5% of the 30-day error budget. I see that 36 hours is 5% of 30 days, but why are these two things linked? For example, an event could potentially consume any amount of an error budget over any window size, it completely depends on what the error budget is.

In addition, it then states the following formula for detection time:

(1−SLO/error ratio)×alerting window size

Why is the detection time proportional to the alerting window size? If there is a sudden spike in errors that triggers an alert, as long as the alerting window covers the period over which the errors happened then the detection will be the same for any alerting window size.

I feel the thing I am missing is the same for both of these statements which is why I am asking about them together.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.