Modernize IT Operations with Gemini: Two use cases that boost efficiency

Authors:
Jingyi Wang, GenAI Field Solution Architect
Shirong Liang, GenAI Field Solution Architect

The paradigm for IT Operations is shifting from reactive firefighting to proactive, intelligent automation. At the heart of this evolution is generative AI, which is poised to redefine what’s possible for efficiency and insight. While the potential is clear, concrete implementation details are scarce. In this article, we go behind the scenes with customer practice to reveal a blueprint for this transition, demonstrating two transformative use cases where Gemini was applied to automate complex operation processes and deliver powerful data insights.

Case study # 1 natural language to PromQL for dashboard generation

The challenge: The ad-hoc query bottleneck

Business users needed instant access to operational metrics to make decisions, but lacked the expertise to write queries in Prometheus Query Language (PromQL). This led to a constant stream of ad-hoc requests that consumed valuable engineering time and left business users waiting. The goal was to break this cycle with a self-service tool that could translate simple, natural language questions into expert-level PromQL queries, generating dashboards on the fly.

The solution: Text-to-PromQL framework


Metrics dictionary and Text2PromQL solution workflow

1. The semantic foundation: An automated metric dictionary

Before analyzing questions, the AI needs to understand the company’s unique data landscape. The team built a system that automatically creates and maintains a “metric dictionary.”

  • It ingests existing PromQL statements from across the company’s systems.
  • Using Gemini and Google Search Grounding, it parses these statements to identify all available metrics and their associated labels.
  • This creates a comprehensive, living dictionary that serves as the semantic foundation, ensuring the AI’s suggestions are always relevant and context-aware.
PROMPT_METRICS_DESC = """You are a Prometheus Query Language or PromQL expert. Your task is to extract the names of the metric and labels and generate the corresponding descriptions for each using the provided sample queries.

INSTRUCTIONS:
- Always use Google Search to search for the metric as the first step. Do not rely on internal knowledge. Begin the analysis by searching for the meaning of metric on Google such as searching for "<metric name>".
- Indicate the hierarchical and scope related concepts of the monitored application or system.
- The labels field is an empty list when no label indicated.
- The output consists of attributes defined below:
metric_name: the metric name extracted from promql.
description: describe purely the metric meaning and its recommended usage while ignoring provided filters.
labels: a list of labels extracted from promql.
label_name: label name extracted from promql.
label_description: describe purely the label meaning and its corresponding usage while ignoring provided values.

INPUT:
```PromQL
{promql_list_str}
/```

OUTPUT FORMAT:
```json
{
"metric_name": "",
"description": "",
"labels":[
{
"label_name": "",
"label_description": ""}, ...
]
}
/```

OUTPUT:"""

tools = [
 types.Tool(google_search=types.GoogleSearch())
]

generate_content_config = types.GenerateContentConfig(
 ..., tools = tools,
)

2. The Text-to-PromQL generation pipeline

When a user asks a question, it triggers a multi-stage process powered by Gemini:

Step 2.1 Query enhancement:

The user’s natural language input is first refined by Gemini to improve clarity and specificity, and transform especially when multilingual support is needed.

PROMPT_REWRITE_QUERY = """Your main objective is to rewrite and refine the question in English.

Refine the given question to produce a queryable statement. The refined question should be self-contained, requiring no additional context for accurate PromQL generation.

The refined question should maintain the original question structure sentence by sentence.
Make sure all the information is included in the re-written question. Identify and include the label and its value to filter or join on based on the original question.

Below is the provided question:

{config_system} Query | {user_query}

Rewrite the question to:"""

Step 2.2 Contextual retrieval:

Gemini analyzes the enhanced query to break down the tasks. It then uses Retrieval Augmented Generation (RAG) to search two specialized knowledge bases: the metric dictionary to find candidate metrics and the best practice pool for related known good examples. This step gives the AI crucial, real-world domain context. In examples we use BigQuery ML and Vector Search to create the knowledge bases.

def generate_and_store_metric_embeddings(client, dataset, destination_table_name, embedding_model_name, ref_table_name):
 client.query_and_wait(
     f"""CREATE OR REPLACE TABLE `{dataset}.{destination_table_name}` AS
         SELECT * FROM ML.GENERATE_EMBEDDING(
           MODEL `{dataset}.{embedding_model_name}`,
           (SELECT metric_name, system, metric_details, metric_description as content FROM `{dataset}.{ref_table_name}`),
           STRUCT('RETRIEVAL_DOCUMENT' as task_type)
         );"""
 )
...

def retrieve_matches(bq_client, query, dataset, emb_table_name, embedding_model_name, config_system=None, topk=5, similarity_threshold=0.3):
 ...
   results = bq_client.query_and_wait(
       f"""SELECT query.query, distance, base.metric_name, base.system, base.metric_details, base.content FROM VECTOR_SEARCH(
             (SELECT * FROM `{dataset}.{emb_table_name}`),
             'ml_generate_embedding_result',
             (
             SELECT ml_generate_embedding_result, content AS query
             FROM ML.GENERATE_EMBEDDING(
             MODEL `{dataset}.{embedding_model_name}`,
             (SELECT '{query}' AS content),
             STRUCT('RETRIEVAL_QUERY' as task_type)
             )
             ),
             top_k => {topk},
             distance_type=>'COSINE')
             WHERE 1-distance > {similarity_threshold};"""
   ).to_dataframe()
 return results

Step 2.3 Generation & evaluation:

Gemini generates the final PromQL query based on the enhanced query and retrievals. We can use Gemini to evaluate multiple generated candidates and select the most optimal one.

PROMPT_DRAFT_PROMQL_WITH_EXAMPLES = """You are a Prometheus Query Language or PromQL guru. Your task is to write a promQL that answers the following question while using the provided context.

<Metrics Schema>
{key_metrics_content_str}
</Metrics Schema>

<Examples>
{examples_content_str}
</Examples>

<Guidelines>
- Think step by step strictly following the USER QUESTION. Make sure all the information is analyzed thoroughly.
- Find the right metric to use from the Metrics Schema based on the USER QUESTION.
- Find the labels to filter on based on the requirements. Use the labels user specifies in the USER QUESTION even if not exists in the metric label list. Pay close attention to the instanceId provided in the user question.
- If the user question mentions relating or joining metrics, explicitly identify the label and use it for joining.  Do not omit the join operation.  For example, if the user asks to relate metrics A and B along the dimension X, the PromQL should include an `on(X)` clause or similar join mechanism.
- Offset extraction: Check whether there is an explicit definition for offset, for example yesterday, one day ago
- Interval extraction: Use rate of change calculation ONLY when user specifies an interval in the USER QUESTION.
- Distinguish rate and irate functions: Use irate function to calculate an instant rate of change for example qps or queries per second. Use rate function to calculate average rate of change when user specifies an interval. Use interval value of 1m by default when not specified in the user query.
- Return value: MUST Return ONLY one time series and ALWAYS use aggregation to safely satisfy the requirement.
- Redis master instance filtering as requested by the user: Consider how to use the absence of  `redis_master_link_up` metric to identify the master. Often use `unless on(instance) redis_master_link_up` clause.
</Guidelines>

<USER QUESTION>
{user_query}
</USER QUESTION>

Generate PromQL for USER QUESTION :"""


def generate_promql(user_query, config_system):
# rewrite user query especially for non-English cases
 rewritten_user_query = rewrite_query(user_query, config_system)


# analyze user query and retrive relevant metrics from knowledge base
 key_metrics_content= identify_key_metrics(rewritten_user_query, config_system)
 key_metrics_content_str = "\n\n".join(key_metrics_content)


# retrive relevant examples from the other knowledge base
 examples_content = retrieve_relevant_examples(rewritten_user_query, config_system)
 examples_content_str = "\n\n".join(examples_content)


 response_completed = False
 while True:
   if response_completed:
     break
# generate promql including the retrived information above
   response_text = generate_promql_with_metrics(user_query, key_metrics_content_str, examples_content_str)
   ...

Case study #2: Automated big data log analysis

The challenge: The operational drag of big data log analysis

For the operations team at a customer, manually analyzing massive, complex application logs to debug failures was a major bottleneck. Sifting through unstructured files, often exceeding GB, was slow and inefficient. Traditional keyword searches missed crucial context, especially for subtle, performance-degrading errors hidden within successful jobs. The goal was to automate this process, transforming reactive log analysis into a proactive, intelligent operation.

The solution: An LLM-Powered log analysis framework


LLM-Powered log analysis solution architecture

A comprehensive log analysis solution was built using Gemini to automate log ingestion, analysis, and troubleshooting. The framework’s key features include:

  • Intelligent Log Handling: The system tackles massive log files by employing a smart splitting strategy, breaking them into smaller, contextually relevant chunks for granular analysis.
  • Multimodal Analysis Pipeline: At its core, a multi-stage pipeline driven by Gemini’s advanced reasoning ingests logs and even screenshots of errors. It then automatically extracts, classifies, and analyzes the information to diagnose the root cause.
  • Grounded and Reliable Suggestions: To ensure accuracy, the analysis is grounded against both Google Search and an internal knowledge base via Vertex AI Search. This provides real-world context and mitigates model hallucinations, producing reliable troubleshooting advice.
  • Aggregated Summaries: The system aggregates its findings into a high-level summary, giving engineers a quick overview of all error types and their frequencies to prioritize actions.

Key step 1: Log preprocessing (handling largesScale logs)

In real-world operational scenarios, log files can easily reach gigabytes in size, far exceeding the token limits of any large model. Therefore, intelligent preprocessing is the cornerstone that makes this entire solution viable.

Conceptual approach & pseudocode:

The preprocessing logic follows a two-step approach designed to intelligently reduce the log size while preserving crucial error information.

  1. Filter Successful Tasks: The first and most critical optimization is to parse the log and remove all task attempts that completed successfully. This dramatically reduces the volume of data by focusing only on the failures, which are the primary interest for analysis.
  2. Split if Necessary: After filtering, the system checks the size of the condensed log. If it still exceeds the model’s token limit, a splitting logic is applied. This logic is designed to break the file into smaller chunks at the task level, ensuring that each chunk remains a valid, structurally-consistent JSON object that can be analyzed independently.

Key step 2: Dual analysis model and structured prompts

We use two parallel analysis modes to meet operational needs at different levels.

2.1 Aggregated analysis

To quickly get a statistical overview of the most common error types and their frequencies in the current application.

# A prompt specifically for high-level statistical analysis

aggregated_analysis_prompt = """
<Role>You are a data analysis expert.</Role>
<Task>
Conduct a statistical analysis of all failed tasks in the input YARN application log.
</Task>
<Guidelines>
1. Identify all failed Tez task attempts.
2. Classify the errors based on the `taskAttemptErrorEnum` field.
3. Count the frequency of each error type.
4. Summarize the analysis results.
</Guidelines>

<OutputFormat>
Please strictly follow the JSON format below:
```json
[
  {{
    "Error type": "The classification of the error",
    "Frequency": {{ "count": 1 }},
    "IDs": ["A list of all attempt IDs belonging to this error type"]
  }}
]
</OutputFormat>
<LogInput>
{log_data}
</LogInput>
"""

2.2 In-depth diagnosis

To provide a detailed root cause analysis and actionable solutions for specific errors by combining an internal knowledge base (RAG) and external search (Tool Calling).

# A structured prompt specifically for in-depth diagnosis
in_depth_diagnosis_prompt = """
<Role>
You are a big data platform operations expert, proficient in platforms like Hadoop and Tez.
</Role>

<Context>
Your job is to analyze YARN application logs and provide professional diagnostics and recommendations. Note that the application uses a managed Tez service provided by {{platform}}, and the user does not have direct access to the underlying configuration.
</Context>

<Guidelines>
1.  First, perform an application-level analysis, extracting the final status and diagnostic information of the YARN application.
2.  Second, dive into the task level, extracting the `diagnostics` information from all failed `task attempts`.
3.  Then, **aggregate `task attempts` that have the exact same error message** and conduct an in-depth analysis for each unique error to explain its root cause.
4.  Finally, for each error, provide actionable troubleshooting advice that a user can perform within the managed {{platform}} environment.
5.  **Ensure all analysis is based on the input log and the output is in English.**
</Guidelines>

<OutputFormat>
Please strictly follow the JSON format below, without adding any extra explanations:
```json
{{
 "Application Analysis": "A summary analysis of the overall application run status",
 "Application Recommendations": "High-level troubleshooting recommendations based on the application analysis",
 "Task Errors": [
  {{
   "Error Type": "The error type inferred from taskAttemptErrorEnum or diagnostics",
   "Error Message": "The complete, unmodified diagnostic information extracted from the task attempt",
   "Error Analysis": "An in-depth explanation of the root cause for this class of error",
   "Recommendations": "Specific, actionable troubleshooting advice for this error",
   "TaskList": ["A list of all task attempt IDs that experienced this error"]
  }}
 ]
}}

</OutputFormat>
<LogInput>
{log_data}
</LogInput>
"""

2.3 RAG & tool calling

from vertexai.preview.generative_models import Tool, grounding

# --- Knowledge Base (RAG) Tool ---

DATA_STORE_ID = "<your-datastore-id>"
knowledge_base_tool = Tool.from_retrieval(
    grounding.Retrieval(
        grounding.VertexAISearch(datastore=DATA_STORE_ID)
    )
)

# --- External Search (Tool Calling) Tool ---

google_search_tool = Tool.from_google_search_retrieval(
    grounding.GoogleSearchRetrieval()
)

# You can pass both tools simultaneously when calling the model

model.generate_content(
     prompt,
     tools=[knowledge_base_tool, google_search_tool]
 )

Key step 3: Extended capability: multimodal error analysis

Often, a simple screenshot of an error is more intuitive than lengthy logs. Gemini’s powerful multimodal capabilities allow our system to directly “read” and understand these screenshots.

from vertexai.preview.generative_models import GenerativeModel, Part

# --- Step 1: Define the prompt for multimodal analysis ---

multimodal_prompt = """
<Role>You are a big data platform operations expert.</Role>
<Task>
Please analyze the following error screenshot and provide a professional diagnosis and recommendations.
</Task>
<Context>
The application uses a managed Tez service provided by AWS EMR, and the user does not have direct access to the underlying configuration.
</Context>
<OutputFormat>
Please output in JSON format, including "Error Type", "Error Message", "Error Analysis", and "Recommendations".
</OutputFormat>
"""

# --- Step 2: Define the function to call the model ---

def analyze_error_screenshot(gcs_image_uri):
    """Analyzes an error screenshot on GCS using Gemini's multimodal capabilities."""
    model = GenerativeModel("gemini-1.5-pro-001")
    image_part = Part.from_uri(gcs_image_uri, mime_type="image/png")

    response = model.generate_content(
        [multimodal_prompt, image_part],
        generation_config={"response_mime_type": "application/json"}
    )
    return response.text

# Example:
# screenshot_uri = "gs://your-bucket/error_screenshot.png"
# analysis_result = analyze_error_screenshot(screenshot_uri)
# print(analysis_result)

Why SmartOps: Broadening access, deepening insight

  • Freed Up Experts, Empowered Everyone. It automated tedious manual tasks like writing queries and debugging logs. This freed senior engineers for strategic work while empowering business users with instant, self-serve dashboards and giving junior engineers AI-driven expert guidance.
  • A Foundational Shift to Proactivity: The impact transcended simple automation. Instead of waiting for problems to arise or requests to come in, the organization can now proactively identify hidden system inefficiencies and provide immediate access to data insights.
  • A Proven Blueprint for Future Automation: These successes created a reliable and scalable pattern for applying generative AI to operational challenges. They served as a foundational blueprint for building a truly intelligent SmartOps capability across the entire enterprise.

Ready to modernize your IT Operations?

The efficiency gains and streamlined workflows you’ve seen are not theoretical—they are achievable results. By combining your operational knowledge with the power of generative AI, you can eliminate manual toil and unlock new levels of productivity. Here’s how to get started:

  • Build your foundation on Vertex AI, your end-to-end platform for deploying generative AI solutions like Gemini.
  • Create your knowledge base using Vertex AI Search to give your models the specific context of your operations.
  • Securely centralize your data and logs with Cloud Storage to feed your AI pipeline.

Don’t just read about the future of IT—build it. Start your first project on Google Cloud today.

2 Likes