Score:0

Large tmp file transfer hangs in nginx php-fpm proxy

cn flag

We have a server that's running a WordPress site with a serverpilot-installed nginx stack on Ubuntu 20 LTS.

Very large uploads appear to get stuck in the handoff between the nginx proxy and PHP, and I have come to the end of what I know how to troubleshoot without just poking at it to see what happens (which is rarely a good use of time or way to move forward). So far as I can tell, I have increased the necessary timeouts and raised all the disk limits, but I am obviously still missing something.

For our use-case, we need to allow for uploads up to 50GB, which we have working without issue in a staging environment that runs a standard LAMP stack. We have no issues with files under ~2GB, but anything over that may or may not fail based on some criteria I haven't been able to track down. While troubleshooting the issue at an earlier time, the time at which the file copy stopped seemed arbitrary (with success up to 10GB), but with the current configuration (see below), it's consistently stopping when the PHP tmp file reaches 2GB.

When we are uploading very large files, we can watch the incoming temp file consume disk space until the entire file exists in /mnt/tmp/[nginx_tmp_path] (yes, there is plenty of available disk space). After that, we can see the file being copied to the php tmp path, but after a few seconds, the php tmp file quits growing in size and the copy process hangs. Eventually, one of the 600 second timeouts is reached, and there are logged errors (see below). In this screenshot, we have a completed (from the browser/end-user perspective) 14.8GB upload that hung on the tmp file transfer at about 2GB. PHP tmp file stops growing

At the end of the 600 second period, this is in the Apache error log:

(70008)Partial results are valid but processing is incomplete: [client <MY_IP>] AH01075: Error dispatching request to : (reading input brigade), referer: https://<THE_URL>/wp-admin/media-new.php

And the nginx error log said this:

2512066#0: *9 upstream timed out (110: Connection timed out) while sending request to upstream, client: <MY_IP>, server: <INTERNAL_IP>, request: "POST /wp-admin/async-upload.php HTTP/2.0", upstream: "http://127.0.0.1:81/wp-admin/async-upload.php", host: "<THE_URL>", referrer: "https://<THE_URL>/wp-admin/media-new.php"

It is important to note that these messages do not appear in the log file until 10 minutes after the upload initially arrives completely at the server, which is 9 or more minutes after the file copy appears to be paused/hung. I am convinced that something is causing the file copy to get stuck, and then eventually a timeout is reached - I do not believe that the issue is with the timeout itself. During the interim between the file copy stopping and the timeout error appearing in the log file, there is no increased or unusual activity on the server, and all services function and respond as expected.

With the current configuration, the PHP tmp file always grows to 2097152 KB (according to du -a) which makes me believe that I am hitting a built-in file size limit that I haven't uncovered as-of-yet.

The relevant nginx server config options are: In the server context:

    proxy_connect_timeout 600;
    proxy_read_timeout 600;
    proxy_send_timeout 600;
    proxy_max_temp_file_size 51200m;

    fastcgi_connect_timeout 600;
    fastcgi_read_timeout 600;
    fastcgi_send_timeout 600;
    fastcgi_request_buffering off;

    keepalive_timeout 600;
    send_timeout 600;

    client_max_body_size 0;
    client_body_temp_path /mnt/tmp;
    client_body_in_file_only clean;

Apache's VirtualHost configuration:

    RequestReadTimeout header=0 body=0
    Timeout 3600
    ProxyTimeout 3600

And finally, PHP config:

memory_limit = -1
max_execution_time = 0
max_input_time = -1
post_max_size = 50G
upload_max_filesize = 50G
default_socket_timeout = -1

I am at a loss as to what could be causing the symptom I am seeing. Any pointers are appreciated!

Additional notes: The symptoms make me feel that it is not relevant, but in case it is... The site runs through WP Rocket, but there is no external proxy service like CloudFlare.

UPDATE: I added these lines to the nginx config:

    proxy_http_version 1.1;
    proxy_set_header Connection "";

This changed the symptom slightly, but did not fix the problem. With this change, behavior has reverted to what I explained previously and the file transfer stalls at an unpredictable spot. The first example below stopped at 3823176 KB and the second stopped at 3264364 KB. The reason for the difference in behavior does not make sense to me, but is worth reporting. enter image description here enter image description here

UPDATE 2: I have been able to definitively pin this to being an issue in the handoff of the tmp file between nginx and php, but I can't seem to nail down the specific thing that is causing the process to hang.

We can skip the nginx proxy and use only PHP's tmp by adding these lines to the nginx config:

    proxy_buffering off;
    proxy_request_buffering off;

With this configuration, files go directly into /mnt/tmp/php<RND_STR> and when the upload is complete, our application properly picks the file out of tmp and completes its tasks.

However, this slows uploads to roughly 1/3 the available bandwidth, so it isn't a good solution. It does, however, prove that this is not an application issue.

So this is what is happening:

  1. User uploads a large file (50GB is our use-case maximum)
  2. The file arrives in the nginx tmp location in its entirety
  3. An attempt is made to copy the file from nginx tmp into PHP tmp - the copy process will stall within a few seconds somewhere unpredictable, but between 3GB and 10GB. [3b] At this time, we can see both tmp files and the PHP tmp file contains a number of bytes that should be growing until it equals the size of the nginx tmp file, but they are not. Both files will sit entirely as-is until one of the 600 second timeouts is reached (see above), then an error appears in the log files and both tmp files disappear. [3c] If the file is under 3GB, it will work every time. If the file is more than 3GB, it will work sometimes, but not others - the smaller the file, the more likely it is to work.
  4. Bypassing nginx's tmp works entirely as expected, except that uploads are slow.

Something is definitely getting hung up during the tmp file handoff between nginx and PHP when the files are obnoxiously huge, and I would love to figure out what it is.

Michael Hampton avatar
cz flag
It's worth asking why the PHP application is taking more than 600 seconds to process the file.
Chris Ostmo avatar
cn flag
I increased all of the timeouts because I expect the handoff of a 50GB file to take more than the default 60 seconds. It's not the processing of PHP that will timeout - the handoff of the temp file will timeout the proxy. I've admittedly opened up too wide in my quest for a solution.
Chris Ostmo avatar
cn flag
... However, PHP isn't taking 600 seconds to process the file - something stalls a few seconds into the file copy, and 600 seconds is simply when it stops waiting and throws an error. Our PHP application doesn't get a chance to start working on the file until it's moved out of `tmp` which isn't a state we're reaching. (sorry, tried to edit my original comment, but was too late)
Michael Hampton avatar
cz flag
Hmm. Well, all I can really tell from the errors posted is that nginx appears to have passed it off to Apache successfully, and Apache failed to pass it off to PHP. For me this makes the prime suspect the PHP code that handles the upload. You should get the app's developer involved. But some other things do look a little odd. You've mounted separate filesystems for the temporary files? What filesystem type are they?
Chris Ostmo avatar
cn flag
I'm the developer, and this error happens from a vanilla WordPress media upload dialog, so there isn't any code at play here that isn't used by a whole lot of people. The file never completely finishes a handoff from nginx's tmp to PHP's tmp, so I don't believe that the application can possibly be interfering. Yes, the `tmp` folder is mounted on its own ext4 partition. Also, the changing behavior when I change server configuration isn't consistent with application error.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.