I have two computers, Orin and NUC, which are connected via ethernet cable directly.
NUC is connected to a router with internet via WiFi (for completeness, not relevant to problem, I hope), and is providing network access to Orin via the following lines in a sourced file:
sudo nmcli c up id orin ifname enp88s0
sudo iptables -t nat -A POSTROUTING -o enp88s0 -j MASQUERADE
While the Orin's networking was configured via GUI with address 192.168.3.2, netmask 255.255.255.0, and gateway 192.168.3.1, the IP of the NUC.
This configuration, at the very least, gives Orin access to the internet (and makes it SSH-able from other computers on the wireless network).
data:image/s3,"s3://crabby-images/42bbe/42bbea0ac5ef995d1fa82722699280fe76bdb520" alt="Network diagram with Orin connected to NUC via ethernet"
I was originally sending 3MB images using ROS from Orin to NUC and losing a good chunk of them; if I was sending at 30Hz, I'd get them on the NUC at 20Hz. If I was sending at 15Hz, I'd get them at 10Hz. If I try to receive using another computer on the wireless network, the receive rate would tank down to 5Hz or even 1Hz. Furthermore, the rate at which I'd get them was random; it wasn't losing every second message consistently, for example.
If I sent and received both on the Orin, I'd get them at the full 30Hz, or about 100MB/s.
I ran iperf, first with the NUC, then with the Orin as the server, and in both cases, it would give 2.36gb/s of bandwidth (2.5gb ethernet on both). This means the full 30Hz send rate with ROS of 100MB/s certainly isn't saturating the connection.
In order to remove ROS as the culprit, I wrote two scripts which send an image again and again as fast as possible over TCP using sockets in Python; I've attached them at the bottom of this post. Running the socket scripts and sending a 4MB image from the Orin to the NUC gives 56MB/s. If I instead send a 80KB image, I get 300MB/s, so full connection saturation.
If I send the 4MB image in the opposite direction, from the NUC to the Orin, I also get 300MB/s.
So the final problem statement seems to be: sending large TCP/TCPROS messages from the Orin to the NUC results in significantly lower throughput than their connection can handle.
Is it the ethernet-to-wifi bridging configuration? Is it an issue with how I am/ROS is sending TCP packets?
Edit: I've renamed this post to reflect that fact that messages are not being lost, since the number of messages sent and received are the same.
Edit 2: I reduced the MTU size on both Orin and NUC from 1500 (actually 1466 with ifconfig
) to first 966, then to 466, increasing throughput with the 4MB image from the Orin to the NUC to 83MB/s and 100MB/s respectively. Reducing the MTU size to 216 gave 65MB/s, giving diminishing returns and also not saturating the hardware. I'm not sure if this is a hint or unrelated though.
sudo ifconfig eth0 down
sudo ifconfig eth0 mtu 1500 # ends up being 1466
sudo ifconfig eth0 up
Edit 3: I sent a 500MB .zip file between Orin and NUC via rync; from Orin to NUC gave 18MB/s, from NUC to Orin gave 40MB/s. Not sure if this gives additional information though.
rsync -azvhPr test.zip [email protected]:/home/orin
Edit 4: I removed the ethernet-to-wifi bridge between Orin and NUC by not running the nmcli
and iptables
commands above, then configured both devices to treat ethernet like its own network by setting their addresses on those interfaces to 10.0.0.x
, netmask still 255.255.255.0
, no gateway. Running the socket scripts with 4MB images now gives 300MB/s from Orin to NUC, effectively saturating their connection.
I think the network bridge to give the Orin internet via the NUC was causing the low throughput issue, although I lack the knowledge to say why this is the case.
A nice little kicker to wrap this all up: the original use-case was sending images via ROS. We've since switched from rgb to mono images, which reduced the total message size from 4MB to 0.9MB, which allows for full 15Hz or 30Hz message reading with the original, somehow-broken network configuration.
Code
Note: the HOST address is given using the subnet addresses. So for the NUC, instead of 192.168.1.243, it's router-given IP, I'm using 192.168.3.1, its subnet-with-the-Orin IP address. In the case of Edit 4, the IP is 10.0.0.1.
sender.py
import socket
import struct
import pickle
import logging
import time
from wrapper import Wrapper
logging.getLogger().setLevel(logging.DEBUG)
# ip address and port of this device
# if sender and receiver both on this device, localhost is fine
HOST = '192.168.3.1'
PORT = 50007
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind((HOST, PORT))
s.listen()
conn, addr = s.accept()
print('Connected by ' + str(addr))
messages_received = 0
total_message_bytes = 0
end_of_messages = False
first_message = True
try:
while not end_of_messages:
dsp = conn.recv(4)
if first_message:
start_time = time.time()
first_message = False
if len(dsp) != 4:
logging.info('End of messages')
end_of_messages = True
continue
data_size = struct.unpack('>I', dsp)[0]
data_id = struct.unpack('>I', conn.recv(4))[0]
recv_payload = b''
recv_remaining = data_size
while recv_remaining != 0:
recv_payload += conn.recv(recv_remaining)
recv_remaining = data_size - len(recv_payload)
wrapper = pickle.loads(recv_payload)
messages_received += 1
total_message_bytes += data_size
#print(wrapper.rostopic)
#print(wrapper.timestamp)
#cv2.imshow("Display window", wrapper.msg)
except KeyboardInterrupt:
logging.info('Closing program')
end_time = time.time()
total_time = end_time - start_time
mb = total_message_bytes / 1000000
logging.info(f'Total time: {total_time}')
logging.info(f'Total messages: {messages_received}')
logging.info(f'Messages per second: {messages_received / total_time}')
logging.info(f'MB received: {mb}')
logging.info(f'MB per second: {mb / total_time}')
logging.info(f'MB per message: {mb / messages_received}')
conn.close()
receiver.py
import socket
import struct
import pickle
import time
import logging
from wrapper import Wrapper
logging.getLogger().setLevel(logging.DEBUG)
with open('image.jpg', 'r+b') as image:
img = image.read()
logging.info(f'Image, when opened as pure bytes, is {len(pickle.dumps(img))/1000000}MB')
wrapper = Wrapper(rostopic='sensor_msgs/Image', msg=img, timestamp=1999999)
# should be IP and port of server sending messages to
# if sender and receiver both on this device, localhost is fine
HOST = '192.168.3.1'
PORT = 50007
CONNECT_RETRY = 5
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
start = time.time()
connected = False
while connected == False:
try:
s.connect((HOST, PORT))
except ConnectionRefusedError:
logging.info(f'Could not connect to server, will try again in {CONNECT_RETRY} seconds')
time.sleep(CONNECT_RETRY)
else:
connected = True
logging.info("Connected to server")
messages_sent = 0
try:
while True:
# Pickle the object and send it to the server
data_pickled = pickle.dumps(wrapper)
s.sendall(struct.pack('>I', len(data_pickled))) # I = unsigned long, > = big endian
s.sendall(struct.pack('>I', 1)) # maybe use long unsigned int, just in case
s.sendall(data_pickled)
messages_sent += 1
except KeyboardInterrupt:
logging.info('Closing program')
s.shutdown(socket.SHUT_RDWR)
except ConnectionError:
logging.info('Connection closed/reset/aborted closing program')
s.close()
logging.debug(f'Messages send: {messages_sent}')
wrapper.py
class Wrapper:
"""
Wraps a ROS message, and additionally specifies the rostopic
and timestamp.
"""
def __init__(self, rostopic, msg, timestamp):
self.rostopic = rostopic
self.msg = msg
self.timestamp = timestamp