I'm monitoring some multi-GPU machines and want to make a combined CPU/GPU utilization graph with GPU as positive and CPU as negative.
I can create such a graph just fine for a single GPU against 100 - (cpu.idle / #cores)
, but run into issues when trying to use the mean GPU utilization values, as calculated using sum
and cdef
.
Below are four situations to illustrate the issues for a machine with two GPUs. Config and output are shown below a short description:
- Baseline. I can plot the CPU and individual GPU values without problem.
# Shows the individual values without problem
test0.graph_title Test 0: baseline values
test0.graph_args --base 1000 -l -100 -u 100 -r
test0.graph_vlabel CPU / GPU
test0.graph_category system
test0.graph_order \
cpu=multigpu.example.com:cpu.idle \
gpu1=multigpu.example.com:nvidia_gpu_utilization.utilization0 \
gpu2=multigpu.example.com:nvidia_gpu_utilization.utilization1
test0.cpu.cdef 100,cpu,48,/,-
- I can also create my intended GPU-positive-CPU-negative plot without problem for an individual GPU's utilization combined with the
cdef
'd CPU value
# Correctly shows GPU0 values as positive, CPU values as negative
test1.graph_title Test 1: direct
test1.graph_args --base 1000 -l -100 -u 100 -r
test1.graph_vlabel CPU / GPU
test1.graph_category system
test1.graph_order \
cpu=multigpu.example.com:cpu.idle \
gpu1=multigpu.example.com:nvidia_gpu_utilization.utilization0
test1.cpu.cdef 100,cpu,48,/,-
test1.cpu.graph no
test1.gpu1.negative cpu
- If I simply plot the CPU and mean of 2 GPUs on the same graph, the CPU values are no longer correct, but seem to be the sum of the GPU-mean and CPU values? No idea what is happening here...
# CPU values show up incorrect here
test2.graph_title Test 2: mean
test2.graph_args --base 1000 -l -100 -u 100 -r
test2.graph_vlabel CPU / GPU
test2.graph_category system
test2.graph_order \
cpu=multigpu.example.com:cpu.idle \
gpu
test2.cpu.cdef 100,cpu,48,/,-
test2.gpu.label gpu mean
test2.gpu.sum \
multigpu.example.com:nvidia_gpu_utilization.utilization0 \
multigpu.example.com:nvidia_gpu_utilization.utilization1
test2.gpu.cdef gpu,2,/
- If I try to combine them into a positive/negative graph, the rendering errors with
Not a valid vname ccpu
in munin-graph.log (where 'cpu' is my variable name)
test3.graph_title Test 3: up/down
test3.graph_args --base 1000 -l -100 -u 100 -r
test3.graph_vlabel CPU / GPU
test3.graph_category system
test3.graph_order \
cpu=multigpu.example.com:cpu.idle \
gpu
test3.cpu.cdef 100,cpu,48,/,-
test3.gpu.label gpu mean
test3.gpu.sum \
multigpu.example.com:nvidia_gpu_utilization.utilization0 \
multigpu.example.com:nvidia_gpu_utilization.utilization1
test3.gpu.cdef gpu,2,/
test3.cpu.graph no
test3.gpu.negative cpu
munin-graph.log:
2021/06/25 16:21:28 [RRD ERROR] Unable to graph test3-day.png : Not a valid vname: ccdefcpu in line GPRINT:ccdefcpu:LAST:%6.2lf%s/\g
2021/06/25 16:21:28 [RRD ERROR] rrdtool 'graph' 'test3-day.png' \
'--title' \
'Test 3: up/down - by day' \
'--start' \
'-2000m' \
'--base' \
'1000' \
'-l' \
'-100' \
'-u' \
'100' \
'-r' \
'--vertical-label' \
'CPU / GPU' \
'--slope-mode' \
'--height' \
'175' \
'--width' \
'400' \
'--imgformat' \
'PNG' \
'--lazy' \
'--font' \
'DEFAULT:0:DejaVuSans,DejaVu Sans,DejaVu LGC Sans,Bitstream Vera Sans' \
'--font' \
'LEGEND:7:DejaVuSansMono,DejaVu Sans Mono,DejaVu LGC Sans Mono,Bitstream Vera Sans Mono,monospace' \
'--color' \
'BACK#F0F0F0' \
'--color' \
'FRAME#F0F0F0' \
'--color' \
'CANVAS#FFFFFF' \
'--color' \
'FONT#666666' \
'--color' \
'AXIS#CFD6F8' \
'--color' \
'ARROW#CFD6F8' \
'--border' \
'0' \
'-W' \
'Munin 2.0.66' \
'DEF:acpu=/var/lib/munin/multigpu.example.com-cpu-idle-d.rrd:42:MAX' \
'DEF:icpu=/var/lib/munin/multigpu.example.com-cpu-idle-d.rrd:42:MIN' \
'DEF:gcpu=/var/lib/munin/multigpu.example.com-cpu-idle-d.rrd:42:AVERAGE' \
'DEF:az2_1=/var/lib/munin/multigpu.example.com-nvidia_gpu_utilization-utilization1-g.rrd:42:MAX' \
'DEF:iz2_1=/var/lib/munin/multigpu.example.com-nvidia_gpu_utilization-utilization1-g.rrd:42:MIN' \
'DEF:gz2_1=/var/lib/munin/multigpu.example.com-nvidia_gpu_utilization-utilization1-g.rrd:42:AVERAGE' \
'DEF:az2_0=/var/lib/munin/multigpu.example.com-nvidia_gpu_utilization-utilization0-g.rrd:42:MAX' \
'DEF:iz2_0=/var/lib/munin/multigpu.example.com-nvidia_gpu_utilization-utilization0-g.rrd:42:MIN' \
'DEF:gz2_0=/var/lib/munin/multigpu.example.com-nvidia_gpu_utilization-utilization0-g.rrd:42:AVERAGE' \
'CDEF:acdefz2_0=az2_0,UN,0,az2_0,IF' \
'CDEF:icdefz2_0=iz2_0,UN,0,iz2_0,IF' \
'CDEF:gcdefz2_0=gz2_0,UN,0,gz2_0,IF' \
'CDEF:ccdefz2_0=gcdefz2_0' \
'CDEF:acdefz2_1=az2_1,UN,0,az2_1,IF,acdefz2_0,ADDNAN,2,/' \
'CDEF:icdefz2_1=iz2_1,UN,0,iz2_1,IF,icdefz2_0,ADDNAN,2,/' \
'CDEF:gcdefz2_1=gz2_1,UN,0,gz2_1,IF,gcdefz2_0,ADDNAN,2,/' \
'CDEF:ccdefz2_1=gcdefz2_1' \
'COMMENT: ' \
'COMMENT:Cur (-/+)' \
'COMMENT:Min (-/+)' \
'COMMENT:Avg (-/+)' \
'COMMENT:Max (-/+) \j' \
'LINE1:gcdefz2_1#00CC00:gpu mean ' \
'GPRINT:ccdefcpu:LAST:%6.2lf%s/\g' \
'GPRINT:ccdefz2_1:LAST:%6.2lf%s' \
'GPRINT:icdefcpu:MIN:%6.2lf%s/\g' \
'GPRINT:icdefz2_1:MIN:%6.2lf%s' \
'GPRINT:gcdefcpu:AVERAGE:%6.2lf%s/\g' \
'GPRINT:gcdefz2_1:AVERAGE:%6.2lf%s' \
'GPRINT:acdefcpu:MAX:%6.2lf%s/\g' \
'GPRINT:acdefz2_1:MAX:%6.2lf%s\j' \
'CDEF:acdefcpu=100,acpu,48,/,-' \
'CDEF:icdefcpu=100,icpu,48,/,-' \
'CDEF:gcdefcpu=100,gcpu,48,/,-' \
'CDEF:ccdefcpu=gcdefcpu' \
'CDEF:re_zero=gcdefcpu,UN,0,0,IF' \
'CDEF:ngcdefcpu=gcdefcpu,-1,*' \
'LINE1:ngcdefcpu#00CC00' \
'LINE1:re_zero#000000' \
'VRULE:1624630818#999999' \
'COMMENT:Last update\: Fri Jun 25 16\:20\:18 2021\r' \
'--end' \
'1624630500'
2021/06/25 16:21:28 [RRD ERROR] Unable to graph test3-week.png : Not a valid vname: ccpu in line GPRINT:ccpu:LAST:%6.2lf%s/\g
[... repeated details omitted for brevity ...]
2021/06/25 16:21:28 [RRD ERROR] Unable to graph test3-month.png : Not a valid vname: ccdefcpu in line GPRINT:ccdefcpu:LAST:%6.2lf%s/\g
[...]