What tools do you use to measure your MTTR as the Ops Team?

Charming Borg

8/29/23, 8:28 AM

and do you measure it at all?

My problem is that when outage is alerted, it feels waste of time to create a JIRA ticket first, so I start solving it right away. Besides, some outages are solved by workarounds first and then revisited to solve them properly.

0 + 0

outage

measurement

Score:1

Server

Rob

8/29/23, 3:01 PM

"My problem is that when outage is alerted, it feels waste of time to create a JIRA ticket first"

That is of course easily solved, most alert systems can raise several alerts at the same time and one of those alerts can be the automatic creation of a Jira ticket.

Part of closing that Jira ticket can then be the administrative task of recording (in whatever way/system is suitable for you) what you agree upon as the repair time.

(Already implied but let me state that explicitly: the ticket resolution time tracked by your ticketing system is not the same as the time-to-repair.)

When ticket resolution times are important and a performance metric themselves, you may want close that automatically generated ticket for the outage immediately after the outage has been resolved.
When you start a root-cause analysis (RCA) investigation use a related but new problem investigation ticket #XYZ (which has different performance criteria and gets reported on differently than tickets regarding outages.).

Depending on the RCA outcomes you may start work on a permanent fix / mitigation measures that you track in a different way again, depending on what needs to be done.

0 + 0

Charming Borg

8/29/23, 3:47 PM

Theoretically I could create JIRA tickets, but I'm getting quite a lot of false positives and so then I would need to go to JIRA all the time and mark of them as WontFix. Also, do you really use JIRA for your MTTR report? I agree I could do it theoretically, but I'm not sure how well it would work in practice.

Rob

8/29/23, 4:08 PM

False positives in your alerting are a separate issue, but we do generate incidents for every alert (in a ticketing system other than Jira). Time to repair means different things to different people and businesses, resolving the outage is often done by a restart, fail-over to a back system etc. For some that is the repair, but for others the actual repair includes performing the RCA, tracking down the bug, fixing that in code, the whole QA cycle until finally a release is deployed into production.

Rob

8/29/23, 4:11 PM

We report automatically on ticket resolution times and also on availability and the duration of outages but not on MTTR . Many things are easy fixes, but others simply take a long time (and have little priority)

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: What tools do you use to measure your MTTR as the Ops Team?

TH: คุณใช้เครื่องมือใดในการวัด MTTR ของคุณในฐานะทีมปฏิบัติการ

RO: Ce instrumente folosiți pentru a vă măsura MTTR-ul ca echipă de operațiuni?

RU: Какие инструменты вы используете для измерения MTTR в качестве оперативной группы?

VI: Bạn sử dụng công cụ nào để đo lường MTTR của mình với tư cách là Nhóm vận hành?

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.