Building service health observability framework on Google Cloud using Personalized Service Health (PSH)

Introduction

Personalized Service Health (PSH) in Google Cloud Platform (GCP) provides a tailored view of the health of services and resources that are relevant to your projects. This capability is crucial for organizations to proactively respond to incidents and maintain business continuity. This guide outlines how to enable PSH across your organization, aggregate events, derive insights, and set up critical alerts and communications.

1. Enabling Personalized Service Health (PSH) across your organization

PSH is enabled by default for individual projects once there’s a Google Cloud resource within them. However, for an organizational view, you’ll primarily leverage Cloud Monitoring and Cloud Logging to aggregate and analyze these events. There isn’t a single “enable PSH for the organization” button, but rather a strategy to centralize health data.

Key components:

  • Service Health API: This API provides programmatic access to service health information.
  • Cloud Logging: Collects all logs, including service health notifications.
  • Cloud Monitoring: Used for creating dashboards, alerts, and custom metrics based on health events.

Here is a diagram that illustrates the observability framework proposed in this article

This diagram shows the flow of information, starting from Personalized Service Health (PSH) events in individual GCP projects and ending with notifications sent to various communication channels.

Here’s a breakdown of the flow:

GCP organization: PSH events are generated within each project in your organization.

Centralized logging & analytics: Cloud Logging aggregates these events, and an organization-level log sink exports them to a central BigQuery dataset for analysis.

Monitoring & alerting: Cloud Monitoring uses the data in BigQuery to create logs-based metrics. Alerting policies are then configured to trigger when these metrics meet specific criteria.

Communication channels: When an alert is triggered, notifications are sent through various channels like email, SMS, or PagerDuty.

Automated ChatOps: For more advanced automation, alerts can be published to a Pub/Sub topic, which then triggers a Cloud Function to send customized messages to chat platforms like Slack or Google Chat.

2. Aggregating events at the organization level

To get a unified view, you’ll centralize logs from all projects to an organization-level logging sink.

1. Create a log sink at the organization level:

This sink will export logs from all projects within your organization to a central Cloud Storage bucket, BigQuery dataset, or Pub/Sub topic. For analytical purposes, BigQuery is often preferred.

Using gcloud:

gcloud logging sinks create organization-health-sink \
    logging.googleapis.com/organizations/YOUR_ORGANIZATION_ID \
    bigquery.googleapis.com/projects/YOUR_CENTRAL_PROJECT_ID/datasets/organization_health_logs \
    --include-children --log-filter='resource.type="project" AND (protoPayload.serviceName="servicehealth.googleapis.com" OR textPayload:"Google Cloud incident" OR jsonPayload.type="com.google.cloud.servicehealth")'

  • Replace YOUR_ORGANIZATION_ID with your actual organization ID.

  • Replace YOUR_CENTRAL_PROJECT_ID with the project where your BigQuery dataset resides.

  • --include-children is crucial as it ensures logs from all projects within the organization are included.

The --log-filter is designed to capture relevant service health events. You might need to refine this filter based on the exact log entries for PSH events. PSH events often appear in project activity logs.

2. Grant permissions to the log sink service account:

The service account created for the sink needs permissions to write to the BigQuery dataset.

gcloud bigquery datasets get-iam-policy organization_health_logs --project YOUR_CENTRAL_PROJECT_ID
# Identify the writerIdentity from the sink creation output, it usually looks like:
# serviceAccount:o<YOUR_ORGANIZATION_ID>-<RANDOM_STRING>@gcp-sa-logging.iam.gserviceaccount.com
# Then grant the BigQuery Data Editor role:
gcloud bigquery datasets add-iam-policy-binding organization_health_logs \
    --member 'serviceAccount:o<YOUR_ORGANIZATION_ID>-<RANDOM_STRING>@gcp-sa-logging.iam.gserviceaccount.com' \
    --role 'roles/bigquery.dataEditor' --project YOUR_CENTRAL_PROJECT_ID

3. Deriving insights from the data

Once your service health events are aggregated in BigQuery, you can run powerful queries to gain insights.

Example BigQuery queries:

  • Events by folder/department: If you’ve organized your projects into folders representing departments, you can enrich your logs with folder information or infer it from project IDs if you have a naming convention. Assuming your log entries include project IDs, you can join with a mapping table.

Using SQL

SELECT
    JSON_EXTRACT_SCALAR(jsonPayload, '$.incidentId') AS incident_id,
    JSON_EXTRACT_SCALAR(jsonPayload, '$.affectedResources[0].type') AS resource_type,
    JSON_EXTRACT_SCALAR(jsonPayload, '$.summary') AS summary,
    -- You would need a mapping table here to join project_id to folder/department
    -- For example: projects.project_id = project_id_to_folder_mapping.project_id
    -- And then select folder_name or department_name from the mapping table
    protopayload.resource.labels.project_id,
    timestamp
FROM
    `YOUR_CENTRAL_PROJECT_ID.organization_health_logs.YOUR_LOGS_TABLE_NAME`
WHERE
    JSON_EXTRACT_SCALAR(jsonPayload, '$.type') = 'com.google.cloud.servicehealth' -- Example filter for PSH events
ORDER BY
    timestamp DESC

Note: The exact structure of jsonPayload for PSH events might vary. You’ll need to inspect your actual log entries to refine the JSON_EXTRACT_SCALAR paths.

  • Events by application level: Similar to folders, you’ll need to either tag your projects with application metadata or maintain a mapping of project IDs to applications.
  • Top affected services/regions:
SELECT
    JSON_EXTRACT_SCALAR(jsonPayload, '$.affectedResources[0].type') AS affected_service,
    COUNT(DISTINCT JSON_EXTRACT_SCALAR(jsonPayload, '$.incidentId')) AS num_incidents
FROM
    `YOUR_CENTRAL_PROJECT_ID.organization_health_logs.YOUR_LOGS_TABLE_NAME`
WHERE
    JSON_EXTRACT_SCALAR(jsonPayload, '$.type') = 'com.google.cloud.servicehealth'
    AND timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY
    affected_service
ORDER BY
    num_incidents DESC
LIMIT 10

4. Setting up critical alerts for events

Utilize Cloud Monitoring to create custom metrics and alerting policies based on the logs in BigQuery.

1. Create a logs-based metric (if needed):

If your BigQuery sink captures the raw log entries, you can create a logs-based metric directly from these logs in Cloud Monitoring. For PSH events, you’ll typically be looking for specific log entries that indicate an incident.

Example: Creating a logs-based metric for “Critical PSH Incident”

  • Go to Cloud Monitoring > Logs Explorer.
  • Enter a query to filter for critical PSH events (e.g., jsonPayload.severity="CRITICAL" AND jsonPayload.type="com.google.cloud.servicehealth").
  • Click “Create Metric” above the results.

Give it a name (e.g., critical_psh_incidents) and a description.

2. Create an alerting policy:

  • Go to Cloud Monitoring > Alerting.
  • Click “Create Policy.”
  • Select a Metric: Search for the custom logs-based metric you just created (e.g., logging/user/critical_psh_incidents).
  • Configure Alert Trigger: Set the threshold. For example, “Any value is above 0” for 1 minute, meaning any critical PSH incident will trigger an alert.

Configure Notification Channels: Select or create notification channels for email, SMS, PagerDuty, Slack, Pub/Sub, etc.

3. Example notification channel setup:

  • Email: your-email@example.com
  • Slack: Integrate with your Slack workspace via webhooks.
  • Pub/Sub: Create a Pub/Sub topic where your custom applications can subscribe and process alerts (e.g., for automated remediation or more sophisticated communication workflows).

5. Ongoing communications via chat, email, SMS, etc.

Automating communications is vital for rapid response.

  • Email and SMS: Handled directly by Cloud Monitoring’s notification channels.
  • Chat (e.g., Slack, Google Chat):
    • Cloud Monitoring Webhooks: Configure a Cloud Monitoring notification channel to send alerts to a Slack or Google Chat webhook URL.
    • Pub/Sub + Cloud Functions: For more customized messages or integration with multiple chat platforms, set up a Pub/Sub topic as an alert destination. A Cloud Function can subscribe to this topic, parse the alert payload, and then send formatted messages to various chat platforms via their APIs.
  • Example Cloud Function (Python) for Slack:
import base64
import json
import os
import requests

def send_slack_notification(event, context):
    pubsub_message = base64.b64decode(event['data']).decode('utf-8')
    alert_payload = json.loads(pubsub_message)

    incident_id = alert_payload['incident']['incident_id']
    summary = alert_payload['incident']['summary']
    state = alert_payload['incident']['state']
    resource_name = alert_payload['incident']['resource_name']

    slack_webhook_url = os.environ.get('SLACK_WEBHOOK_URL')

    if not slack_webhook_url:
        print("SLACK_WEBHOOK_URL environment variable not set.")
        return

    message = {
        "text": f"GCP Service Health Alert - Incident ID: {incident_id}\n"
                f"Summary: {summary}\n"
                f"State: {state}\n"
                f"Affected Resource: {resource_name}"
    }

    try:
        response = requests.post(slack_webhook_url, json=message)
        response.raise_for_status()
        print(f"Slack notification sent successfully: {response.status_code}")
    except requests.exceptions.RequestException as e:
        print(f"Error sending Slack notification: {e}")

# To deploy this function:
# gcloud functions deploy send_slack_notification \
#   --runtime python39 \
#   --trigger-topic YOUR_ALERT_PUBSUB_TOPIC \
#   --set-env-vars SLACK_WEBHOOK_URL=YOUR_SLACK_WEBHOOK_URL
  • Remember to replace YOUR_ALERT_PUBSUB_TOPIC and YOUR_SLACK_WEBHOOK_URL.
  • The alert_payload structure will depend on the specific format Cloud Monitoring sends to Pub/Sub. You’ll need to inspect it.

6. Gaining Org-Level Insights with Looker Dashboards

While BigQuery is excellent for in-depth analysis and alerting, a business intelligence (BI) platform like Looker can provide a high-level, visual overview of your organization’s service health. By connecting Looker to your centralized BigQuery dataset, you can create interactive dashboards that are accessible to a wider audience, from engineering leads to executive stakeholders.

1. How to Set Up Looker Dashboards

  1. Connect Looker to BigQuery:
  • In the Looker admin panel, create a new database connection.
  • Select “Google BigQuery Standard SQL.”
  • Provide the necessary connection details, including the project ID, dataset, and a service account with the required BigQuery permissions.
  1. Create a LookML Model:
  • Looker uses a modeling layer called LookML to define dimensions and measures for your data.
  • Create a new LookML project and generate a model from your BigQuery table.
  • Define dimensions for fields like resource.labels.project_id, jsonPayload.category, jsonPayload.service_name, and jsonPayload.severity.
  • Create measures to count the number of events, distinct impacted projects, etc.
  1. Build Your Dashboard:
  • Once your LookML model is in place, you can start building your dashboard by adding various visualizations (known as “Looks”).

2. Example Looker Dashboard Visualizations

Here are some examples of insightful visualizations you can create:

  • Org-wide Service Health Overview (Scorecard):
    • Display key metrics like “Total Active Incidents,” “Impacted Projects,” and “P1 Events (Last 24h)” as large, easy-to-read numbers.
  • Impacted Project Hotspots (Map/Table):
    • Create a table or a map visualization that shows the projects with the highest number of service health events. This can help you identify which teams or applications are most affected.
  • Event Category Trend Analysis (Line Chart):
    • Plot the number of events over time, broken down by category (e.g., INCIDENT, PROBLEM, MERGED_INCIDENT). This can help you identify trends and patterns in service health.
  • Top 10 Affected Services (Bar Chart):
    • Display the services with the most incidents, giving you a clear picture of where to focus your reliability engineering efforts.

By leveraging Looker, you can transform your raw service health data into actionable insights, providing a clear and comprehensive view of your organization’s service health posture.

Conclusion

Ultimately, building a robust service health observability framework on Google Cloud is achieved by combining centralized data, proactive alerting, and strategic visualization. By aggregating service health events into BigQuery and configuring critical alerts with Cloud Monitoring, an organization establishes a proactive posture against incidents. Extending this framework with tools like Looker for organization-wide dashboards transforms raw data into actionable insights. This comprehensive strategy enables organizations to respond to incidents more effectively, minimize business impact, and leverage data for long-term service improvement and reliability.

1 Like