Score:0

Defining mean utilization of two GPUs changes other value or errors when used with negative

in flag

I'm monitoring some multi-GPU machines and want to make a combined CPU/GPU utilization graph with GPU as positive and CPU as negative.

I can create such a graph just fine for a single GPU against 100 - (cpu.idle / #cores), but run into issues when trying to use the mean GPU utilization values, as calculated using sum and cdef.

Below are four situations to illustrate the issues for a machine with two GPUs. Config and output are shown below a short description:

  1. Baseline. I can plot the CPU and individual GPU values without problem.
# Shows the individual values without problem
test0.graph_title Test 0: baseline values
test0.graph_args --base 1000 -l -100 -u 100 -r
test0.graph_vlabel CPU / GPU
test0.graph_category system
test0.graph_order \
        cpu=multigpu.example.com:cpu.idle \
        gpu1=multigpu.example.com:nvidia_gpu_utilization.utilization0 \
        gpu2=multigpu.example.com:nvidia_gpu_utilization.utilization1
test0.cpu.cdef 100,cpu,48,/,-

test 0: baseline values

  1. I can also create my intended GPU-positive-CPU-negative plot without problem for an individual GPU's utilization combined with the cdef'd CPU value
# Correctly shows GPU0 values as positive, CPU values as negative
test1.graph_title Test 1: direct
test1.graph_args --base 1000 -l -100 -u 100 -r
test1.graph_vlabel CPU / GPU
test1.graph_category system
test1.graph_order \
        cpu=multigpu.example.com:cpu.idle \
        gpu1=multigpu.example.com:nvidia_gpu_utilization.utilization0
test1.cpu.cdef 100,cpu,48,/,-
test1.cpu.graph no
test1.gpu1.negative cpu

test 1: successful gpu1 with cpu as negative

  1. If I simply plot the CPU and mean of 2 GPUs on the same graph, the CPU values are no longer correct, but seem to be the sum of the GPU-mean and CPU values? No idea what is happening here...
# CPU values show up incorrect here
test2.graph_title Test 2: mean
test2.graph_args --base 1000 -l -100 -u 100 -r
test2.graph_vlabel CPU / GPU
test2.graph_category system
test2.graph_order \
        cpu=multigpu.example.com:cpu.idle \
        gpu
test2.cpu.cdef 100,cpu,48,/,-
test2.gpu.label gpu mean
test2.gpu.sum \
        multigpu.example.com:nvidia_gpu_utilization.utilization0 \
        multigpu.example.com:nvidia_gpu_utilization.utilization1
test2.gpu.cdef gpu,2,/

test 2: defining a mean gpu changes cpu values

  1. If I try to combine them into a positive/negative graph, the rendering errors with Not a valid vname ccpu in munin-graph.log (where 'cpu' is my variable name)
test3.graph_title Test 3: up/down
test3.graph_args --base 1000 -l -100 -u 100 -r
test3.graph_vlabel CPU / GPU
test3.graph_category system
test3.graph_order \
        cpu=multigpu.example.com:cpu.idle \
        gpu
test3.cpu.cdef 100,cpu,48,/,-
test3.gpu.label gpu mean
test3.gpu.sum \
        multigpu.example.com:nvidia_gpu_utilization.utilization0 \
        multigpu.example.com:nvidia_gpu_utilization.utilization1
test3.gpu.cdef gpu,2,/
test3.cpu.graph no
test3.gpu.negative cpu

munin-graph.log:

2021/06/25 16:21:28 [RRD ERROR] Unable to graph test3-day.png : Not a valid vname: ccdefcpu in line GPRINT:ccdefcpu:LAST:%6.2lf%s/\g
2021/06/25 16:21:28 [RRD ERROR] rrdtool 'graph' 'test3-day.png' \
        '--title' \
        'Test 3: up/down - by day' \
        '--start' \
        '-2000m' \
        '--base' \
        '1000' \
        '-l' \
        '-100' \
        '-u' \
        '100' \
        '-r' \
        '--vertical-label' \
        'CPU / GPU' \
        '--slope-mode' \
        '--height' \
        '175' \
        '--width' \
        '400' \
        '--imgformat' \
        'PNG' \
        '--lazy' \
        '--font' \
        'DEFAULT:0:DejaVuSans,DejaVu Sans,DejaVu LGC Sans,Bitstream Vera Sans' \
        '--font' \
        'LEGEND:7:DejaVuSansMono,DejaVu Sans Mono,DejaVu LGC Sans Mono,Bitstream Vera Sans Mono,monospace' \
        '--color' \
        'BACK#F0F0F0' \
        '--color' \
        'FRAME#F0F0F0' \
        '--color' \
        'CANVAS#FFFFFF' \
        '--color' \
        'FONT#666666' \
        '--color' \
        'AXIS#CFD6F8' \
        '--color' \
        'ARROW#CFD6F8' \
        '--border' \
        '0' \
        '-W' \
        'Munin 2.0.66' \
        'DEF:acpu=/var/lib/munin/multigpu.example.com-cpu-idle-d.rrd:42:MAX' \
        'DEF:icpu=/var/lib/munin/multigpu.example.com-cpu-idle-d.rrd:42:MIN' \
        'DEF:gcpu=/var/lib/munin/multigpu.example.com-cpu-idle-d.rrd:42:AVERAGE' \
        'DEF:az2_1=/var/lib/munin/multigpu.example.com-nvidia_gpu_utilization-utilization1-g.rrd:42:MAX' \
        'DEF:iz2_1=/var/lib/munin/multigpu.example.com-nvidia_gpu_utilization-utilization1-g.rrd:42:MIN' \
        'DEF:gz2_1=/var/lib/munin/multigpu.example.com-nvidia_gpu_utilization-utilization1-g.rrd:42:AVERAGE' \
        'DEF:az2_0=/var/lib/munin/multigpu.example.com-nvidia_gpu_utilization-utilization0-g.rrd:42:MAX' \
        'DEF:iz2_0=/var/lib/munin/multigpu.example.com-nvidia_gpu_utilization-utilization0-g.rrd:42:MIN' \
        'DEF:gz2_0=/var/lib/munin/multigpu.example.com-nvidia_gpu_utilization-utilization0-g.rrd:42:AVERAGE' \
        'CDEF:acdefz2_0=az2_0,UN,0,az2_0,IF' \
        'CDEF:icdefz2_0=iz2_0,UN,0,iz2_0,IF' \
        'CDEF:gcdefz2_0=gz2_0,UN,0,gz2_0,IF' \
        'CDEF:ccdefz2_0=gcdefz2_0' \
        'CDEF:acdefz2_1=az2_1,UN,0,az2_1,IF,acdefz2_0,ADDNAN,2,/' \
        'CDEF:icdefz2_1=iz2_1,UN,0,iz2_1,IF,icdefz2_0,ADDNAN,2,/' \
        'CDEF:gcdefz2_1=gz2_1,UN,0,gz2_1,IF,gcdefz2_0,ADDNAN,2,/' \
        'CDEF:ccdefz2_1=gcdefz2_1' \
        'COMMENT:        ' \
        'COMMENT:Cur (-/+)' \
        'COMMENT:Min (-/+)' \
        'COMMENT:Avg (-/+)' \
        'COMMENT:Max (-/+) \j' \
        'LINE1:gcdefz2_1#00CC00:gpu mean ' \
        'GPRINT:ccdefcpu:LAST:%6.2lf%s/\g' \
        'GPRINT:ccdefz2_1:LAST:%6.2lf%s' \
        'GPRINT:icdefcpu:MIN:%6.2lf%s/\g' \
        'GPRINT:icdefz2_1:MIN:%6.2lf%s' \
        'GPRINT:gcdefcpu:AVERAGE:%6.2lf%s/\g' \
        'GPRINT:gcdefz2_1:AVERAGE:%6.2lf%s' \
        'GPRINT:acdefcpu:MAX:%6.2lf%s/\g' \
        'GPRINT:acdefz2_1:MAX:%6.2lf%s\j' \
        'CDEF:acdefcpu=100,acpu,48,/,-' \
        'CDEF:icdefcpu=100,icpu,48,/,-' \
        'CDEF:gcdefcpu=100,gcpu,48,/,-' \
        'CDEF:ccdefcpu=gcdefcpu' \
        'CDEF:re_zero=gcdefcpu,UN,0,0,IF' \
        'CDEF:ngcdefcpu=gcdefcpu,-1,*' \
        'LINE1:ngcdefcpu#00CC00' \
        'LINE1:re_zero#000000' \
        'VRULE:1624630818#999999' \
        'COMMENT:Last update\: Fri Jun 25 16\:20\:18 2021\r' \
        '--end' \
        '1624630500'
2021/06/25 16:21:28 [RRD ERROR] Unable to graph test3-week.png : Not a valid vname: ccpu in line GPRINT:ccpu:LAST:%6.2lf%s/\g
[... repeated details omitted for brevity ...]
2021/06/25 16:21:28 [RRD ERROR] Unable to graph test3-month.png : Not a valid vname: ccdefcpu in line GPRINT:ccdefcpu:LAST:%6.2lf%s/\g
[...]
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.