Authors: Sanjna Srivatsa (Data Scientist, Product, CloudHealth), Darien Schettler (Generative AI Architect, Broadcom), Pathik Sharma (Cloud FinOps Lead, delta, Google Cloud Consulting)
Introduction
At CloudHealth, we pride ourselves on helping businesses optimize their cloud spend. It’s what we do. So, when it came to optimizing our own AI operating costs, we faced a delightful irony: we, the experts in FinOps, found ourselves navigating the complexities of large language model (LLM) expenses. This is the story of how we tackled that challenge head-on by slashing our AI FinOps costs by over a remarkable 90% immediately upon migrating to Google’s Gemini model, all while boosting performance. This move, combined with a continuous optimization strategy, ultimately resulted in a cost reduction of 99.6%. It took some time, but the results were well worth it.
The challenge: Navigating FinOps complexity with AI
Building an AI solution for FinOps presents unique hurdles. The financial data schema is often complex and vast, with hundreds of technical and sometimes non-intuitive column names (e.g., AWS – lineItem/UnblendedCost, Azure – MeterId, GCP - project.ancestry_numbers). This requires deep domain expertise to understand the relationships between fields. Furthermore, user prompts are frequently imprecise (e.g., “What was my EC2 spending?”), leading to ambiguity that the LLM must interpret correctly to avoid generating incorrect SQL and flawed reports. Every incorrect interpretation or inefficient query wasn’t just a poor user experience; it was a direct, wasted cost, driving up our token counts and compute time for an ineffective result. Customization, granularity (hourly vs. daily data), and constantly evolving datasets further complicate the picture.
Initially, we embarked on this AI journey in 2023, selecting a foundational model and building with a budget in mind. However, we quickly overshot our budget by 300% (though fortunately, as a private beta, the numbers were in the tens of thousands). This experience highlighted critical cost and performance issues, including high latency and a lower-than-expected accuracy rate in generating the correct SQL. These challenges prompted our decision to migrate after evaluating more models and their associated costs.
Optimizing AI costs: Our journey to Gemini
Our migration to Gemini was driven by a commitment to optimizing AI costs within our FinOps operations. We focused on several key areas:
-
Smart model selection: Matching the AI model’s prowess to the task complexity.
-
Precision prompting with examples: Engineering for efficiency, task-specificity, and clear output structure.
-
Aggressive caching: Maximizing implicit and explicit caching at both the provider and application level to reduce latency and cost.
-
Keep it lean: Optimizing what is sent to and received from models, focusing on what truly matters.
-
Hybrid AI approaches: Blending LLMs with deterministic logic and classical ML, using the right tool for the job.
-
Proactive governance & Iteration: Establishing cost controls, monitoring usage, and continuously optimizing for value through data-driven decisions and frequent stakeholder meetings.
The migration to Gemini: A phased approach and dramatic results
Our migration was a deliberate, multi-stage process. We didn’t just ‘lift-and-shift’; we meticulously optimized at every layer, from model selection to caching. The table below breaks down the cost and latency at each key stage, based on a monthly model of 100 million tokens.
| Stage | Key Actions | Monthly Cost | Avg. Latency | ∆ Cost (vs. Baseline) |
|---|---|---|---|---|
| Baseline (2023) | Early Large Language Model Stack | $1,029.00 | 3.0 s | - |
| Lift-&-Shift | Migrated to Gemini Pro | $75.00 | 1.2 s | -92.70% |
| Logic | Deterministic Short-Circuit (15% tasks handled by rules engine) | $63.75 | 1.03 s | -93.80% |
| Prompt | Prompt Compression (25% fewer tokens) | $47.81 | 1.03 s | -95.30% |
| Optimize (2025) | Upgrade to Gemini Flash & Caching | $4.39 | 0.17 s | -99.60% |
As the table shows, the initial ‘Lift-and-Shift’ to Gemini Pro (in Stage II) gave us an immediate and massive 92.7% cost reduction.
But we didn’t stop there. The most advanced savings came in Stage IV, where we combined three key tactics:
-
Model tiering: We switched to the hyper-efficient Gemini Flash model for the majority of queries.
-
Hybrid approach: We routed a small percentage of complex misses to a cheaper “Lite” model, perfectly matching cost to complexity.
-
Aggressive caching: We implemented a 95% context-cache hit rate, with cached tokens billed at a 75% discount.
This multi-pronged optimization in Stage IV alone dropped our costs by another 90% and slashed latency by 83%, bringing our total savings to 99.6%.
Evaluating our AI tool for FinOps use cases
To ensure the effectiveness of our AI-powered FinOps tool, we established rigorous evaluation metrics:
-
Query accuracy: Is the data in the response correct (e.g., cost, dates, categories)? After migrating to Gemini, we saw a significant increase in the accuracy of our query responses, leading to more reliable financial insights.
-
Intent recognition: Did the bot correctly interpret the user’s question, especially with ambiguous FinOps terms (e.g., “compute spend”)? Gemini’s advanced understanding of natural language greatly improved our intent recognition, making the tool more effective in addressing user queries.
-
Query quality: Is the SQL correct? Did it choose from the right dataset and construct the right query? The quality of the SQL generated by our AI tool improved considerably, enabling the right data to be pulled and processed efficiently.
-
Hallucination rate (for textual response): How often does the bot invent plausible but incorrect information? The migration to Gemini also contributed to a reduction in hallucination rates, providing more trustworthy textual responses.
-
Speed of query: Beyond cost and accuracy, the speed at which our AI tool could process and respond to queries saw a dramatic improvement, enhancing the overall user experience and enabling quicker decision-making.
Our testing methods included creating a “Validation dataset” – a standard set of commonly asked questions with known, validated answers that served as our benchmark. We also implemented Human-in-the-Loop (HITL) Review, where experts manually review and rate responses across these metrics, which is critical for judging the quality of answers, especially where a technically correct answer might still be unhelpful.
Conclusion: A FinOps revolution with Gemini
Our move to Gemini was a game-changer for CloudHealth. By adopting Google’s Gemini model and optimizing our approach, we cut operating costs by over 99% and significantly boosted performance. This real-world example shows how a leading AI model like Gemini delivers tangible business results and a strong return on investment.
This win gives us the confidence to build more AI capabilities at scale and is a powerful example of the growing partnership between CloudHealth (Broadcom) and Google Cloud.
To learn more about CloudHealth’s AI capabilities click here.
To learn more about how Google Cloud’s FinOps practice, click here.