Score:0

Good technology for a large-scale batch operation on many S3 files in AWS Batch with Spot instances

lb flag

I have an enormous corpus of text data stored in millions of files on S3. It's very common that I want to perform some operation on every one of those files, which uses only that file and creates a new file from it. Usually, I use my company's DataBricks for this, but it's so locked down that it's hard to deploy complex code there.

I've been considering using AWS Batch with Spot Instances as an alternative to DataBricks for some of these jobs. I'd certainly want to use multiple nodes, because the largest single node would be quite incapable of finishing the work in a reasonable time frame. There are, of course, technologies like Apache Spark that are designed for distributed computing, but I'm (a) not confident in my ability to set up my own Spark cluster and (b) not convinced that Spark is necessary for such a simple distributed computing job. Fundamentally, all I need is for the nodes to communicate which files they are planning to work on, what they have finished, and when they turn off. It would be straightforward, if tedious, to maintain all that information in a database, and I have no need of translating all my data into another distributed filesystem.

Is there a good existing technology for this kind of use case?

Tim avatar
gp flag
Tim
You mentioned AWS Batch. What did your research tell you about whether it was suitable for your use case?
Zorgoth avatar
lb flag
Oh, good point. I just realized after looking that up that multi-node jobs aren't supported with Spot instances. It seems like I would be forced to submit multiple single-node jobs if I was going to use it, which is somewhat less appealing.
Score:1
bz flag

Would S3 Batch Operations with Lambda be an option? You would provide a list of the objects and a lambda function to run on each object. S3 Batch Operations will then run the lambda function with each object name and give you a report of the results.

https://docs.aws.amazon.com/lambda/latest/dg/services-s3-batch.html

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.