Suggestion for Non Analytical Distributed Processing Frameworks

us flag

Can someone please suggest a tool, framework or a service to perform the below task faster.

Input : The input to the service is a CSV file which consists of an identifier and several image columns with over a million rows.

Objective: To check if any of the image column of the row meets the minimum resolution and create a new boolean column for every row according to the results.

True - If any of the image in the row meets the min resolution

False - If no images in the row meets the min resolution

Current Implementation: Python script with pandas and multiprocessing running on a large VM(60 Core CPU) which takes about 4 - 5 Hours. Since this is a periodic task we schedule and manage it with Cloud Workflow and Celery Backend.

Note: We are looking to cut down on costs as uptime of server is just about 4-6Hrs a day. Hence 60 Core CPU 24*7 would be a lot of resources wasted.

Options Explored:

  1. We have ruled out Cloud Run due to the memory, cpu and timeout limitations.
  2. Apache Beam with Cloud Dataflow, seems like there is less support for non analytical workloads and Dataframe implementation with Apache Beam looks buggy still.
  3. Spark and Dataproc seems to be good for analytical workloads. Although a Serverless option would be much preferred.

Which direction should i be looking into?


Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.