I have an enormous corpus of text data stored in millions of files on S3. It's very common that I want to perform some operation on every one of those files, which uses only that file and creates a new file from it. Usually, I use my company's DataBricks for this, but it's so locked down that it's hard to deploy complex code there.
I've been considering using AWS Batch with Spot Instances as an alternative to DataBricks for some of these jobs. I'd certainly want to use multiple nodes, because the largest single node would be quite incapable of finishing the work in a reasonable time frame. There are, of course, technologies like Apache Spark that are designed for distributed computing, but I'm (a) not confident in my ability to set up my own Spark cluster and (b) not convinced that Spark is necessary for such a simple distributed computing job. Fundamentally, all I need is for the nodes to communicate which files they are planning to work on, what they have finished, and when they turn off. It would be straightforward, if tedious, to maintain all that information in a database, and I have no need of translating all my data into another distributed filesystem.
Is there a good existing technology for this kind of use case?