Score:0

How to download millions of s3 files and compress them on the fly?

eg flag

I have an S3 bucket with millions of files, and I want to download all of them. Since I don't have enough storage, I would like to download them, compress them on the fly and only then save them. How do I do this?

To illustrate what I mean: aws s3 cp --recursive s3://bucket | gzip > file

Hennes avatar
za flag
Instead of >file you probably can use netcat (pipe though nc).
Tim avatar
gp flag
Tim
A couple of ideas 1) Mount S3 as a drive (google it) and zip it from there 2) Get a spot instance, download and zip. Make sure you're using an S3 gateway endpoint in your VPC to reduce costs.
cn flag
You could also write a lambda that takes a path from S3 and gzips the contents then returns the gzipped file. Then you could use the `aws` CLI to list the files and send requests to the lambda.
John Rotenstein avatar
in flag
"Download" to where? To an Amazon EC2 instance, or your own computer?
Score:0
af flag

It's not clear if you want to keep the uncompressed objects in S3 or if the bucket contents are still changing.

One option you have is to use S3 inventory. It's not instant, but it will automatically generate a list of objects in the bucket and write that to a S3 bucket (the same bucket or another). You could read this list into a small script (whatever you are comfortable with) and have it work one object at a time. Use the S3 CLI to pull down the object, then compress it using the OS/script tools.

I strongly recommend building in something that checks if the compressed object already exists so you can restart the process if it fails or new objects are added without having to process everything again.

If you are writing the compressed objects back to S3, consider using an EC2 instance or Lambda. With Lambda you may need to use a file stream to compress the file on the fly rather than pulling it down. You should be able to find examples of this for at least Python, if not other supported languages.

-- One word of caution, do a rough calculation on how much this is going to cost. Get requests are fairly cheap, but data transfer out can be expensive. Also if you are using any storage class other than Standard, it's probably going to have a retrieval cost associated with it.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.