Good technology for a large-scale batch operation on many S3 files in AWS Batch with Spot instances

Zorgoth

9/23/23, 3:44 PM

I have an enormous corpus of text data stored in millions of files on S3. It's very common that I want to perform some operation on every one of those files, which uses only that file and creates a new file from it. Usually, I use my company's DataBricks for this, but it's so locked down that it's hard to deploy complex code there.

I've been considering using AWS Batch with Spot Instances as an alternative to DataBricks for some of these jobs. I'd certainly want to use multiple nodes, because the largest single node would be quite incapable of finishing the work in a reasonable time frame. There are, of course, technologies like Apache Spark that are designed for distributed computing, but I'm (a) not confident in my ability to set up my own Spark cluster and (b) not convinced that Spark is necessary for such a simple distributed computing job. Fundamentally, all I need is for the nodes to communicate which files they are planning to work on, what they have finished, and when they turn off. It would be straightforward, if tedious, to maintain all that information in a database, and I have no need of translating all my data into another distributed filesystem.

Is there a good existing technology for this kind of use case?

0 + 0

batch-processing

amazon-web-services

distributed-computing

Tim

9/23/23, 7:57 PM

You mentioned AWS Batch. What did your research tell you about whether it was suitable for your use case?

Zorgoth

9/23/23, 8:05 PM

Oh, good point. I just realized after looking that up that multi-node jobs aren't supported with Spot instances. It seems like I would be forced to submit multiple single-node jobs if I was going to use it, which is somewhat less appealing.

Score:1

Server

arjunrawal

1/19/24, 7:43 AM

Would S3 Batch Operations with Lambda be an option? You would provide a list of the objects and a lambda function to run on each object. S3 Batch Operations will then run the lambda function with each object name and give you a report of the results.

https://docs.aws.amazon.com/lambda/latest/dg/services-s3-batch.html

+ 0

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: Good technology for a large-scale batch operation on many S3 files in AWS Batch with Spot instances

TH: เทคโนโลยีที่ดีสำหรับการดำเนินการเป็นชุดขนาดใหญ่ในไฟล์ S3 จำนวนมากใน AWS Batch พร้อมอินสแตนซ์ Spot

RO: Tehnologie bună pentru o operațiune pe lot la scară largă pe multe fișiere S3 în AWS Batch cu instanțe Spot

RU: Хорошая технология для крупномасштабной пакетной обработки множества файлов S3 в AWS Batch with Spot instances.

VI: Công nghệ tốt cho hoạt động hàng loạt quy mô lớn trên nhiều tệp S3 trong AWS Batch với phiên bản Spot

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.