I recently started to study for GCP Cloud Architect exam so I asked my CTO to give me viewer role to our GCP so I can have a look.
According to my knowledge, Infrastructure is not in good shape. Can someone tell me if what I am thinking is correct??
Current Architecture:-
VM instances are n2-standard-16 but CPU utilization hasn’t event exceeded 40% (last 30 days).
VMs are in a unmanaged instance group with max instances is set to 1.
Instance Group is used as a backend service to a load balancer.
Instances are assigned with Public IP even though there is a load balancer.
According to the things I studied so far, tells me this is not optimal. Following are the solutions I suggest. (with my half-baked knowledge)
Solutions I Suggest:-
Change machine family to a cheaper family and add them to a managed instance group with maybe max instances of 5, so the load balancer can distribute the load if load increases.
Remove Public IPs from VM instances and use IAP Tunnelling to SSH into the VMs.
Problem I have even with my suggestions:-
We have one VM instance and public SSH enabled because we are using BitBucket CI/CD pipeline to push changes to production. And BitBucket pipeline directly copies the files to the VM.
But, if I configure the instance group to have multiple instances, pipeline will not copy the files into the other VMs that gets created automatically.
How can I tackle this CI/CD pipeline automation issue??
Well, I wouldn’t say the infrastructure is sticks and stones but it definitely feels like there’s room for improvement.
First of all, you’re totally right, keeping a VM running at 40% CPU usage is just wasting money. The cloud charges by the hour or second, not by how much you use it. Many teams get disappointed because they simply moved their workloads to the cloud without adapting to the cloud model. It won’t be cheaper that way.
Your idea of using smaller VMs inside a Managed Instance Group (MIG) is definitely going in the right direction. It’s more effort to set up, sure but that’s part of doing things the Cloud Way (imo).
About the public IP, using IAP is indeed a better practice, maybe the best one.
Regarding your CI/CD point, I have to admit that I’m more used to serverless and container-based deployments now. Honestly, I’d even ask, wouldn’t something like App Engine be a better fit for your use case in the end?
That said, as far as MIGs go, I’d suggest reading up on how to update instance templates. That may be the key for proper deployments in an autoscaled setup.
Your assessment is correct. The current setup is not optimal for several reasons and your proposed solutions are excellent and align with GCP best practices for building scalable and resilient applications.
Moving to a more cost-effective machine family, which is optimized for cost, is a great first step. Combining this with a managed instance group (MIG) will allow you to take advantage of autoscaling. With autoscaling, you can set a minimum number of running instances and have the group automatically add or remove instances based on load, ensuring you only pay for what you need while maintaining performance.
Removing public IPs and using Identity-Aware Proxy (IAP) for SSH access is a crucial security enhancement. IAP provides a secure way to access your instances for administrative purposes without exposing them to the public internet providing a more secure and manageable solution than public SSH keys..
Your current CI/CD process is a classic example of a mutable infrastructure and your concern in deploying to multiple instances in a managed instance group is valid. The recommended approach is to shift to an “immutable infrastructure” model.
With a cloud-native approach, you basically treat your infrastructure as immutable or something that doesn’t change once it’s up and running. So, if you need to make any updates like new application code, security fixes, or system updates you build a brand new server image and swap out the old servers completely, instead of just tweaking the ones you already have.