I work in a cloud lab rental company that provides temporary GCP environments for students to practice and learn. As part of improving our platform, I am trying to design a cost-control system in Google Cloud Platform (GCP) for multiple projects.
Goal:
I want to automatically stop, scale down, or disable all billable resources within a project when its defined budget threshold is exceeded, in order to prevent further costs.
Scope:
-
The solution should work across multiple projects.
-
Each project must be handled independently. If one project exceeds its budget, only that project’s resources should be affected, without impacting other projects.
Budget notification requirements:
-
When a project reaches 50% and 75% of its budget, notification emails should be sent to a central/management account.
-
When the project reaches 100% of its budget, an email notification should be sent, and an automated action should be triggered to stop, scale down, or disable all stoppable resources within that specific project.
-
The automation must strictly apply only to the project that exceeded its budget, not to other projects.
Current approach:
-
Use GCP Budget alerts with notifications (Pub/Sub and/or email).
-
Trigger a Python-based Cloud Function (Gen2) from Pub/Sub.
-
Use the Cloud Function to identify and stop or disable running resources.
-
Use Terraform to provision and manage the infrastructure.
Challenges / Questions:
-
What is the recommended architecture for implementing this type of budget-based auto shutdown system across multiple projects?
-
How can I reliably identify and handle different resource types (e.g., Compute Engine, GKE, Cloud Run, etc.), given that not all services can be directly “stopped”?
-
What are the best practices for configuring budget alerts for multiple thresholds (50%, 75%, 100%) with both email and Pub/Sub notifications?
-
What are the best practices for ensuring this operates safely and does not unintentionally disrupt critical resources?
-
Are there any limitations or delays in budget alert notifications that could affect real-time cost control?
Additional context:
-
I am using Terraform for infrastructure provisioning.
-
The automation logic will be implemented using Python in Cloud Functions.
-
I understand that GCP does not provide a single API to stop all resources, so I am looking for a practical and scalable approach to handle different resource types.
Any guidance, reference architectures, or best practices would be greatly appreciated.