Score:1

What is causing large TCP messages to have low throughput in one direction when using ethernet-to-wifi bridge?

hk flag

I have two computers, Orin and NUC, which are connected via ethernet cable directly. NUC is connected to a router with internet via WiFi (for completeness, not relevant to problem, I hope), and is providing network access to Orin via the following lines in a sourced file:

sudo nmcli c up id orin ifname enp88s0
sudo iptables -t nat -A POSTROUTING -o enp88s0 -j MASQUERADE

While the Orin's networking was configured via GUI with address 192.168.3.2, netmask 255.255.255.0, and gateway 192.168.3.1, the IP of the NUC. This configuration, at the very least, gives Orin access to the internet (and makes it SSH-able from other computers on the wireless network).

Network diagram with Orin connected to NUC via ethernet

I was originally sending 3MB images using ROS from Orin to NUC and losing a good chunk of them; if I was sending at 30Hz, I'd get them on the NUC at 20Hz. If I was sending at 15Hz, I'd get them at 10Hz. If I try to receive using another computer on the wireless network, the receive rate would tank down to 5Hz or even 1Hz. Furthermore, the rate at which I'd get them was random; it wasn't losing every second message consistently, for example. If I sent and received both on the Orin, I'd get them at the full 30Hz, or about 100MB/s.

I ran iperf, first with the NUC, then with the Orin as the server, and in both cases, it would give 2.36gb/s of bandwidth (2.5gb ethernet on both). This means the full 30Hz send rate with ROS of 100MB/s certainly isn't saturating the connection.

In order to remove ROS as the culprit, I wrote two scripts which send an image again and again as fast as possible over TCP using sockets in Python; I've attached them at the bottom of this post. Running the socket scripts and sending a 4MB image from the Orin to the NUC gives 56MB/s. If I instead send a 80KB image, I get 300MB/s, so full connection saturation. If I send the 4MB image in the opposite direction, from the NUC to the Orin, I also get 300MB/s.

So the final problem statement seems to be: sending large TCP/TCPROS messages from the Orin to the NUC results in significantly lower throughput than their connection can handle.

Is it the ethernet-to-wifi bridging configuration? Is it an issue with how I am/ROS is sending TCP packets?

Edit: I've renamed this post to reflect that fact that messages are not being lost, since the number of messages sent and received are the same.

Edit 2: I reduced the MTU size on both Orin and NUC from 1500 (actually 1466 with ifconfig) to first 966, then to 466, increasing throughput with the 4MB image from the Orin to the NUC to 83MB/s and 100MB/s respectively. Reducing the MTU size to 216 gave 65MB/s, giving diminishing returns and also not saturating the hardware. I'm not sure if this is a hint or unrelated though.

sudo ifconfig eth0 down
sudo ifconfig eth0 mtu 1500 # ends up being 1466
sudo ifconfig eth0 up

Edit 3: I sent a 500MB .zip file between Orin and NUC via rync; from Orin to NUC gave 18MB/s, from NUC to Orin gave 40MB/s. Not sure if this gives additional information though.

rsync -azvhPr test.zip [email protected]:/home/orin

Edit 4: I removed the ethernet-to-wifi bridge between Orin and NUC by not running the nmcli and iptables commands above, then configured both devices to treat ethernet like its own network by setting their addresses on those interfaces to 10.0.0.x, netmask still 255.255.255.0, no gateway. Running the socket scripts with 4MB images now gives 300MB/s from Orin to NUC, effectively saturating their connection. I think the network bridge to give the Orin internet via the NUC was causing the low throughput issue, although I lack the knowledge to say why this is the case.

A nice little kicker to wrap this all up: the original use-case was sending images via ROS. We've since switched from rgb to mono images, which reduced the total message size from 4MB to 0.9MB, which allows for full 15Hz or 30Hz message reading with the original, somehow-broken network configuration.

Code

Note: the HOST address is given using the subnet addresses. So for the NUC, instead of 192.168.1.243, it's router-given IP, I'm using 192.168.3.1, its subnet-with-the-Orin IP address. In the case of Edit 4, the IP is 10.0.0.1.

sender.py

import socket
import struct
import pickle
import logging
import time
from wrapper import Wrapper

logging.getLogger().setLevel(logging.DEBUG)

# ip address and port of this device
# if sender and receiver both on this device, localhost is fine
HOST = '192.168.3.1'
PORT = 50007

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind((HOST, PORT))
s.listen()
conn, addr = s.accept()
print('Connected by ' + str(addr))

messages_received = 0
total_message_bytes = 0
end_of_messages = False
first_message = True

try:
    while not end_of_messages:
        dsp = conn.recv(4)
        if first_message:
            start_time = time.time()
            first_message = False
        if len(dsp) != 4:
            logging.info('End of messages')
            end_of_messages = True
            continue
        data_size = struct.unpack('>I', dsp)[0]
        data_id = struct.unpack('>I', conn.recv(4))[0]
        recv_payload = b''
        recv_remaining = data_size

        while recv_remaining != 0:
            recv_payload += conn.recv(recv_remaining)
            recv_remaining = data_size - len(recv_payload)
        wrapper = pickle.loads(recv_payload)

        messages_received += 1
        total_message_bytes += data_size

        #print(wrapper.rostopic)
        #print(wrapper.timestamp)
        #cv2.imshow("Display window", wrapper.msg)
except KeyboardInterrupt:
    logging.info('Closing program')

end_time = time.time()
total_time = end_time - start_time
mb = total_message_bytes / 1000000

logging.info(f'Total time: {total_time}')
logging.info(f'Total messages: {messages_received}')
logging.info(f'Messages per second: {messages_received / total_time}')
logging.info(f'MB received: {mb}')
logging.info(f'MB per second: {mb / total_time}')
logging.info(f'MB per message: {mb / messages_received}')

conn.close()

receiver.py

import socket
import struct
import pickle
import time
import logging
from wrapper import Wrapper

logging.getLogger().setLevel(logging.DEBUG)

with open('image.jpg', 'r+b') as image:
    img = image.read()
logging.info(f'Image, when opened as pure bytes, is {len(pickle.dumps(img))/1000000}MB')

wrapper = Wrapper(rostopic='sensor_msgs/Image', msg=img, timestamp=1999999)

# should be IP and port of server sending messages to
# if sender and receiver both on this device, localhost is fine
HOST = '192.168.3.1'
PORT = 50007
CONNECT_RETRY = 5

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

start = time.time()
connected = False
while connected == False:
    try:
        s.connect((HOST, PORT))
    except ConnectionRefusedError:
        logging.info(f'Could not connect to server, will try again in {CONNECT_RETRY} seconds')
        time.sleep(CONNECT_RETRY)
    else:
        connected = True
        logging.info("Connected to server")

messages_sent = 0

try:
    while True:
        # Pickle the object and send it to the server
        data_pickled = pickle.dumps(wrapper)
        s.sendall(struct.pack('>I', len(data_pickled))) # I = unsigned long, > = big endian
        s.sendall(struct.pack('>I', 1)) # maybe use long unsigned int, just in case
        s.sendall(data_pickled)
        messages_sent += 1
except KeyboardInterrupt:
    logging.info('Closing program')
    s.shutdown(socket.SHUT_RDWR)
except ConnectionError:
    logging.info('Connection closed/reset/aborted closing program')

s.close()
logging.debug(f'Messages send: {messages_sent}')

wrapper.py

class Wrapper:
    """
    Wraps a ROS message, and additionally specifies the rostopic
    and timestamp.
    """
    def __init__(self, rostopic, msg, timestamp):
        self.rostopic = rostopic
        self.msg = msg
        self.timestamp = timestamp
vidarlo avatar
ar flag
What kind of wireless? Why are you writing your own test code? Can you repeat bandwidth tests with iperf/iperf3, so we have standardized output to look at? Pushing 100MB/s over wifi is a tough challenge.
Joshua O'Reilly avatar
hk flag
Just some generic router running wifi at 5GHz. The main issue is the limited bandwidth between the Orin and NUC over direct ethernet connection; the wireless tests are just to illustrate that it's the connection between those two devices over ethernet that's the issue. I can remove all mention of tests over wifi if it helps avoid confusion as to what I'm asking.
vidarlo avatar
ar flag
I suggest you rewrite your question to focus on your problem, and remove everything *not* related to your problem, and also include a topology.
Joshua O'Reilly avatar
hk flag
Removed WiFi tests and all mentions of WiFi besides the bare minimum in the intro; is a network diagram necessary when it's two devices connected directly via ethernet?
Joshua O'Reilly avatar
hk flag
Added network diagram.
Nikita Kipriyanov avatar
za flag
Theoretically the *congested return path* may make the forward TCP flow slow because returned ACKs are delayed (buffered on the sending side of the congested link), the window is full and there is no way for the congestion control algorithm to adjust.
Joshua O'Reilly avatar
hk flag
Using a smaller image (80KB), or reversing which computer is sending solves the issue and results in a much greater number of messages being sent per second; I have very little networking experience, but wouldn't that indicate it's not a response being congested, assuming response size is not proportionate to the initial message size?
Joshua O'Reilly avatar
hk flag
@vidarlo I forgot to ask this in response to your first comment, can you recommend any testing tools to help me determine the root cause for this?
Score:0
hk flag

I switched the network configuration from an ethernet-to-wifi bridge to simply treating ethernet on both like its own network. From this askubuntu answer: for the ethernet interfaces of both Orin and NUC, I set their addresses to 10.0.0.x, netmasks to 255.255.255.0, and set no gateway. I also didn't run the nmcli and iptables commands.

Doing this allowed me to get full utilization of their direct 2.5gb/s ethernet connection when running the socket scripts. This seems to indicate that the network bridge was causing the low throughput issues, at least when sending images using sockets. While ROS message throughput is still limited even with this configuration, I'm willing to chock it up to ROS not working well with large messages.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.