
Storing 100 million files in the same "directory" under S3-compatible storage?

vn flag

I have > 100 million image files (book covers) as a flat list of files under a single "directory":


A long time ago, these were stored on Amazon S3, and are now on Backblaze B2 (which is S3-compatible).

So far, this worked fine:

  • storing a new file is very quick;
  • retrieving an existing file is very quick.

I'm in the process of migrating once again, to iDrive E2 (S3-compatible as well).

I'm experimenting with moving them using rclone, but after 30 min of waiting for rclone copy to start, I realized that rclone does not start transferring files until it has received the whole file list.

The problem is:

  • a quick benchmark of rclone ls on the /images/ directory tells me that transferring the whole file list would take almost 10 hours
  • any problem during transfer (which will take many days) would restart from zero, forcing rclone to download the whole file list again
  • listing files costs money with B2

I tried configuring rclone to copy only a batch of files:

  • rclone copy "backblaze:/images/0000*", with or without *, does not find any file
  • rclone copy "backblaze:/images/" --include "/0000*" seems to download the whole file list as well, and filter on the client

Strangely, it looks like rclone has no problem retrieving from the server a list of files that are under a given "directory", for example /images/, but cannot do the same with a prefix, such as /images/0000.

I thought that S3, and by extension all S3-compatible storages, stored file paths as a flat structure, and that / was just a character like any other, and that you could easily list files under any prefix, ending or not with a /.

Am I mistaken?

I my next storage (E2), should I store files under sub-directories, such as images/0/0/0/0/, images/0/0/0/1, etc., just like we did in the good old days of storing files in a traditional filesystem?

cn flag

I realized that rclone does not start transferring files until it has received the whole file list.

This is telling me that your problem is less the storage providers and more rclone itself. A solution that started the list-stream and then chunked files as they arrive would be more appropriate than one that needs the entire file list before operating.

I thought that S3, and by extension all S3-compatible storages, stored file paths as a flat structure,

That's definitely how S3 does it, which broke my file-server admin brain when I first ran into it. Given the issues here seem to be metadata related rather than file-layout, it likely doesn't matter.

I sit in a Tesla and translated this thread with Ai:


Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.