We recently started using Cloud Composer for our data engineering pipelines. For even a small environment that autoscales from 1 to 3 (in fact using just 1 worker most of the time), it's quite expensive at ~$350/month. We don't currently have many DAGs running and each dag runs just daily for around 5 to 10 minutes. So our Cloud Composer environment is actually just sitting idlely the vast majority of the time. Since you get charged for the environment which is running 24/7, not just when the tasks are running, the value proposition is not great currently for us, as you can see.
Because of this, I am in quite a bit of contention with my teammate. Despite the unfortunate cost structure at the moment, I still think Airflow/Cloud Composer is the best solution for building and managing data pipelines and going forward we will certainly have more DAGs and more frequently running DAGs so the value proposition will surely improve significantly. I am very much in favor of using a technology that’s future-proof.
However my teammate just can't get over the fact that Cloud Composer is set up such that it's running 24/7 regardless of whether there's task running or not and we are getting charged for all those idling minutes. To him, the dollar cost per task execution time is just ridiculous. He thinks this is clearly a flaw in Airflow’s implementation/design, that nodes are running and cost is building when nothing is being processed. He contends that we can build a much cheaper and equally capable system ourself, say using Cloud Scheduler, Cloud Run/Cloud Functions and taking advantage of background trigger functionalities in Google Cloud document store (i.e., Firestore), e.g., onCreate(), onUpdate() and etc to trigger dependencies between tasks. So in such a system, the fixed cost is much much lower and the vCPU/memory cost is only incurring when tasks are actually running.
My view is that reinventing the wheel, building our own data engineering tool, when there’s a tried and tested solution for our need, is completely unnecessary, especially since the pubic available solution is proven to be scalable and reliable. But, I am really struggling to convince him and counter his arguments.
So here are some questions that I really appreciate your input:
Why is Cloud Composer using GKE under the hood instead of Cloud Run,
or some other mechanism that can completely turn on and off between
task executions and hence not running up cost? Is this a flaw in
Airflow/Cloud Composer’s architecture but people are still willing
to pay for it because of the convenience and there’s no other
alternative? Or was it an intentional design decision and
engineering necessity?
What advantages does a managed Airflow service, e.g., Cloud
Composer, provide compared to an in-house solution like the one
mentioned above, that are important but not immediately obvious?
In general, if you are Cloud Composer user, how do you feel about it
and how do you justify its expense; has it been worth it?