Score:2

How to *actually* exclude a directory in AWS S3 sync?

in flag
DMJ

The aws s3 sync command has an --exclude flag which lets you exclude a folder from the sync. However, even though the files are not uploaded from that directory, the command still looks at and processes all the files in that folder. The reason I wanted to exclude that folder in the first place was because it is a very large folder containing a lot of data, with the data I actually want to sync being just a few MB in the parent folder and a few other subfolders. However, it takes several minutes to sync those few MB, because of the several GB of data in that data subfolder. Is there a way I can actually exclude (e.g. from even being looked at or processed) that subfolder so that the sync command completes in a reasonable amount of time?

Score:3
cn flag

I think this may be a case of mismatched expectations regarding what functionality S3 provides.

S3 does not actually have any structure, the bucket just has a flat set of objects with the full string that might be seen as the "path" being the key of each object.
The ListObjectsV2 API action however provides features like specifying a prefix (only returns objects that have a key that starts with some particular string) and the option of specifying a delimiter (splits keys by the provided delimiter and groups repeating key segments) that allow you to present the contents of a bucket as if it had structure (like what the AWS Console does, for instance).

The aws s3 sync utility presumably also starts working from the normal ListObjectsV2 API action, but this API does not have any functionality equivalent to the --exclude (or --include) options in the sync utility, only the option of getting the list filtered by key prefix.
Hence it would appear that the sync utility has to do the processing of those more flexible filtering options on the client side as it processes the full list of objects for the specified prefix, which will never really be efficient if there is a high number of objects under the specified prefix which are supposed to be skipped.

What you want to do in your scenario is probably to instead specify the prefix or prefixes that you want instead of specifying a more generic prefix and filtering what you don't want. If what you want is not identifiable by prefix, you may want to consider changing your naming so that there is some known prefix that you can specify. (Or possibly even using separate buckets for different types of data, if that makes more senes for your situation.)

in flag
DMJ
I see, that does make some sense in the context of a download from S3. It makes less sense in the context of an upload though: the filesystem I am uploading from does have an actual defined structure, even if S3 is just key-value pairs. I suppose perhaps in this situation it's just a lack of optimization for this use case?
cn flag
@DMJ Right, if you are specifically looking at the "upload from local filesystem to s3" case, I suppose it doesn't matter that much that the opposite case is even more problematic to optimize. An additional concern that also applies to the local case is how `--exclude` can be a pattern written to match any part of the path, so while using a pattern that matches essentially a leading directory seems like it could be optimized for the local filesystem, the general case still requires looking recursively at all the files locally. I can imagine that they have not optimized for that special case.
Score:0
in flag
DMJ

While the answer by Håkan Lindqvist appears to be the technically correct answer, it unfortunately did not solve the problem. Syncing (uploading) a few MB was taking as much as 30 minutes because of a large subfolder that was being excluded anyway. Since the AWS CLI doesn't appear to natively support the functionality I needed, I turned to another tool instead: a shell script.

#!/bin/sh

for localfile in /home/path/to/source/files/*.*
do 
aws s3 cp "$localfile" s3://path/to/bucket/
done

aws s3 sync /home/path/to/source/files/subfolder1 s3://path/to/bucket/subfolder1
aws s3 sync /home/path/to/source/files/subfolder2 s3://path/to/bucket/subfolder2
aws s3 sync /home/path/to/source/files/subfolder3 s3://path/to/bucket/subfolder3
# Deliberately skipping subfolder4
aws s3 sync /home/path/to/source/files/subfolder5 s3://path/to/bucket/subfolder5
aws s3 sync /home/path/to/source/files/subfolder6 s3://path/to/bucket/subfolder6
aws s3 sync /home/path/to/source/files/subfolder7 s3://path/to/bucket/subfolder7
aws s3 sync /home/path/to/source/files/subfolder8 s3://path/to/bucket/subfolder8
aws s3 sync /home/path/to/source/files/subfolder9 s3://path/to/bucket/subfolder9
aws s3 sync /home/path/to/source/files/subfolder10 s3://path/to/bucket/subfolder10

Although this approach solved the issue I was having in my particular circumstance, it is not without downsides:

  • The aws s3 cp command always uploads the file, even if it hasn't changed since last time
  • Running the aws s3 cp command in a for loop seems noticeably slower to me than the aws s3 sync command generally is under normal circumstances.
  • Based on Håkan Lindqvist's answer, I'm not sure this approach would do anything to help someone who was downloading rather than uploading
  • Not cross platform (This woulnd't work on Windows. Fortunately for me I am on Linux.)

Despite the drawbacks, in my circumstances this is more than an order of magnitude faster than using aws s3 sync with the --exclude flag, so I'm content. I do hope Amazon provides a better option in the future though.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.