Can someone please suggest a tool, framework or a service to perform the below task faster.
Input : The input to the service is a CSV file which consists of an identifier and several image columns with over a million rows.
Objective: To check if any of the image column of the row meets the minimum resolution and create a new boolean column for every row according to the results.
True - If any of the image in the row meets the min resolution
False - If no images in the row meets the min resolution
Current Implementation: Python script with pandas and multiprocessing running on a large VM(60 Core CPU) which takes about 4 - 5 Hours. Since this is a periodic task we schedule and manage it with Cloud Workflow and Celery Backend.
Note: We are looking to cut down on costs as uptime of server is just about 4-6Hrs a day. Hence 60 Core CPU 24*7 would be a lot of resources wasted.
Options Explored:
- We have ruled out Cloud Run due to the memory, cpu and timeout limitations.
- Apache Beam with Cloud Dataflow, seems like there is less support for non analytical workloads and Dataframe implementation with Apache Beam looks buggy still.
- Spark and Dataproc seems to be good for analytical workloads. Although a Serverless option would be much preferred.
Which direction should i be looking into?