Score:0

Why is this bash script triggering so many false positives for monitoring memory usage?

fi flag

I am monitoring hundreds of servers both dedicated and virtual using the following script:

#!/bin/bash

PATH=/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin:/root/bin

threshold=90

serverip=$($(which ifconfig) | grep -Eo 'inet (addr:)?([0-9]*\.){3}[0-9]*' | grep -Eo '([0-9]*\.){3}[0-9]*' | grep -v '127.0.0.1' | head -1)
memused=$(free | awk '/Mem/{printf("RAM Usage: %.2f%\n"), $3/$2*100}' |  awk '{print $3}' | cut -d"." -f1)

if [ "$memused" -gt "$threshold" ]
then
    CTIME=$(date +%Y-%m-%d-%H%M%S)
    ps aux > /root/.example/logs/lowmem-"${CTIME}"-ps.log
    top -n 1 -o %MEM -c > /root/.example/logs/lowmem-"${CTIME}"-top.log
    free -m > /root/.example/logs/lowmem-"${CTIME}"-free.log
    mysqladmin proc -v status > /root/.example/logs/lowmem-"${CTIME}"-mysqlproc.log
    bash /example/general/slack.sh "#server-alerts" ":warning: $(hostname) -  ${serverip} - Memory Usage has reached 90% - Check logs /root/.example/logs/lowmem-${CTIME} \n \`\`\`$(head -1 /root/.example/logs/lowmem-"${CTIME}"-free.log) \n $(head -2 /root/.example/logs/lowmem-"${CTIME}"-free.log | tail -1) \n $(tail -1 /root/.example/logs/lowmem-"${CTIME}"-free.log)\`\`\`"
    crontab -l | grep -v '/example/mon_mem.sh' | crontab -
    sleep 900
    crontab -l | { cat; echo "* * * * * bash /example/mon_mem.sh"; } | crontab -
fi

While it works in most cases, we are randomly getting false positives, its completely random servers and its not consistent with each server so one server might trigger but then not trigger ever again(falsely)

Example of a false positive:

total used free shared buff/cache available 
Mem: 2048 345 1580 27 122 1674 
Swap: 2048 0 2048

An alert came in from this server but you can see only 345 MB was in use.

anx avatar
fr flag
anx
A low amount of "free" memory is *good*. If you want to warn on memory pressure, check the "available" number instead. Also, use the same output used to *trigger* the warning to include in the warning *text* for a more useful explanation (you are calling free twice, with likely differing results).
Score:1
fr flag
anx

3 problems:

  1. You are calling free twice: once for triggering the warning, once for sending the report. The numbers will have changed in between. Store the output (in a variable), and retrieve the same data twice.

  2. "Used" memory should approach the total amount of memory, and "free" should approach zero, always. If you have unused memory, that means you have wasted resources that should, while not allocated, at least serve as caches.

    I recommend you change the memused line that currently compares the second against the third column ($3/$2) to instead compare the first against the last column.

  3. Your method of message delivery seems to lose formatting. Might want to check your delivery method (slack.sh) to render your input in monospace, or replace tab&spaces with appropriate spacers.

    This is how the table should look like:

    total used free shared buff/cache available
    Mem: 2048 345 1580 27 122 1674
    Swap: 2048 0 2048

    The five numbers start with the "total" memory, and if anything, the last number is the one you should care about.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.