Frequent 500 errors from Enterprise Document OCR

We’re piloting Document AI’s Enterprise Document OCR, and it’s mostly good, but we’re seeing a very high rate of 500 responses. With retries we’re able to get good results for just about everything, but this error rate doesn’t seem acceptable.

Is this normal for this service? (Below image shows 2 days of requests, grouped by status. blue is OK; green is SERVER_ERROR.)

Details:

  • Processor Type Document OCR, Region us
  • Always 1 page at a time.
  • Using the google-cloud-document_ai rubygem v2.1.0, Google::Cloud\::DocumentAI::V1::DocumentProcessorService::Rest::Client (REST so we can use VCR cassettes in testing).
  • The error raised is “Google::Cloud\::InternalError: An error has occurred when making a REST request: An error occurred.” (backslash inserted to avoid emoji replacement)
  • Nested inside that is “Gapic::Rest::Error: An error has occurred when making a REST request: An error occurred.“
  • Nested inside that is “Faraday::ServerError: the server responded with status 500 for POST https://documentai.googleapis.com/v1/projects/PROJECTID/locations/us/processors/PROCESSORID:process?%24alt=json%3Benum-encoding%3Dint“

We saw pretty consistent results for another day beyond the 2 shown above. Then we sent practically no traffic over the weekend. We’re now three days into the following week, sending a similar amount of traffic, and have seen zero SERVER_ERROR this week. :person_shrugging:

We got a weekend + 4 business-days error-free, then back to unhealthy rate of errors, still all eventually successful with enough retries. Error rate isn’t as bad as in the first couple days, but still very high for a production service.

At 30 days, here’s the error line. Two long bad patches. Mostly good times. :person_shrugging: