We regularly use a Remote Function in BigQuery to connect through to an API endpoint getting the results immediately back into BigQuery: this has been working very reliably for some months now, however overnight our jobs failed with the following error:
An internal error occurred and the request could not be completed. This is usually caused by a transient issue. Retrying the job with back-off as described in the BigQuery SLA should solve the problem: https://cloud.google.com/bigquery/sla. If the error continues to occur please contact support at https://cloud.google.com/support
Retrying the job yields the same results. Everything is in the europe-west2 region.
I can see that the API requests are being made to the endpoint (a Cloud Run function), however nothing ever comes back to BQ.
Are there any ongoing issues with network connectivity in this region affecting the service?
Not an AI generated response / This is a human responding.
Given that a Cloud Run service is running code based on input data received and in your case I am imagining that the Cloud Run services is being invoked through a BQ remote function, what if anything can we tell about the nature of the Cloud Run service invocations?
I’d be tempted to suggest that we instrument the Cloud Run code to create logging records. Perhaps log that it was invoked, log the input parameters or have the input serialized and saved to a GCS file. Let’s try and separate the notion that BQ is invoking Cloud Run from an abstract invocation of Cloud Run.
I hear you say that you have traced the request to arriving at Cloud Run …. but we didn’t say what if anything then happens within Cloud Run. Does it crash, does it fail, does it return useful values? It could be that nothing comes back to BQ because the Cloud Run user code fails and simply doesn’t return anything. I hear you say that it previously ran … so we should also look to see if the code for your Cloud Run was recently changed or redeployed. Since the Cloud Run code is also passed input data … there is the possibility that the input data has changed. Maybe you are now passing in values that previously weren’t accounted for (eg. zeros or nulls or empty strings …)