Our team has to host and maintain a publicly-visible FTP server for the needs of our application. It's currently hosted on a Google Compute Engine VM, and the FTP server software in use is Pure-FTPd. Also, we tried FileMage.
When accessing the server through any mainstream FTP client, like FileZilla, by its public address, the server seems to work perfectly fine, being capable of ingesting/exporting large amounts of data at once without any issues.
The problem
In our pipeline, we have a spot where a big number of files are programmatically downloaded from this server, processed, and then re-exported to an adjacent directory. By the big number, I mean ~2000-8000 files ~20MB in size each going back and forth.
The programmatic FTP client we use is Apache FTPCLient.
This operation seems to complete fine in most cases but will always produce an enormous amount of "Connection timeout" errors, in a spot where the client tries to establish an FTP Control Connection on port 21. Basically, the only reason all files end up transferred is us having a very generous retry mechanism in place. The timeout we use for establishing a control connection is 60 seconds which seems to be higher than what, for example, FileZilla uses.
What could be the culprit of these connection errors? It does not seem to be a hardware issue. Neither downgrading the number of client machines nor using a "load-balanced" FTP server with 10 backend machines in work seemed to help us get rid of these errors.
We pretty much achieved a state where the server is only accessed by two threads simultaneously, with as much connection re-usage as possible, and still, the sporadic "Connection timeout" errors are there. Has anyone run into the same problem?