Score:1

Why should I use Cloud Composer?

US flag
yl_elm_city

We recently started using Cloud Composer for our data engineering pipelines. For even a small environment that autoscales from 1 to 3 (in fact using just 1 worker most of the time), it's quite expensive at ~$350/month. We don't currently have many DAGs running and each dag runs just daily for around 5 to 10 minutes. So our Cloud Composer environment is actually just sitting idlely the vast majority of the time. Since you get charged for the environment which is running 24/7, not just when the tasks are running, the value proposition is not great currently for us, as you can see.

Because of this, I am in quite a bit of contention with my teammate. Despite the unfortunate cost structure at the moment, I still think Airflow/Cloud Composer is the best solution for building and managing data pipelines and going forward we will certainly have more DAGs and more frequently running DAGs so the value proposition will surely improve significantly. I am very much in favor of using a technology that’s future-proof.

However my teammate just can't get over the fact that Cloud Composer is set up such that it's running 24/7 regardless of whether there's task running or not and we are getting charged for all those idling minutes. To him, the dollar cost per task execution time is just ridiculous. He thinks this is clearly a flaw in Airflow’s implementation/design, that nodes are running and cost is building when nothing is being processed. He contends that we can build a much cheaper and equally capable system ourself, say using Cloud Scheduler, Cloud Run/Cloud Functions and taking advantage of background trigger functionalities in Google Cloud document store (i.e., Firestore), e.g., onCreate(), onUpdate() and etc to trigger dependencies between tasks. So in such a system, the fixed cost is much much lower and the vCPU/memory cost is only incurring when tasks are actually running.

My view is that reinventing the wheel, building our own data engineering tool, when there’s a tried and tested solution for our need, is completely unnecessary, especially since the pubic available solution is proven to be scalable and reliable. But, I am really struggling to convince him and counter his arguments.

So here are some questions that I really appreciate your input:

  1. Why is Cloud Composer using GKE under the hood instead of Cloud Run, or some other mechanism that can completely turn on and off between task executions and hence not running up cost? Is this a flaw in Airflow/Cloud Composer’s architecture but people are still willing to pay for it because of the convenience and there’s no other alternative? Or was it an intentional design decision and engineering necessity?

  2. What advantages does a managed Airflow service, e.g., Cloud Composer, provide compared to an in-house solution like the one mentioned above, that are important but not immediately obvious?

  3. In general, if you are Cloud Composer user, how do you feel about it and how do you justify its expense; has it been worth it?

I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.