Unexpected Data Loss When Processing Multiple PDFs with Gemini 1.5 Pro

Hi everyone,

I’m currently working on a project where users can upload multiple PDF files (CVs) to be analyzed by Gemini 1.5 Pro. The use case involves processing 30-50 PDFs at once and generating a structured table summarizing key candidate skills.

I’m using the GenerativeModel API in Vertex AI and passing the files as Part.from_uri() objects (hosted on GCS). Here’s a simplified version of how I handle the attachments:

content_list = self.geminihelper.setup_attachment(uri=message.attachment) #list[Part]
prompt = [self.get_message_content(message)] + content_list #list[Part] + list[Part]
response = chat.send_message(prompt, generation_config=generation_config)

The Issue

While the model correctly receives all the files (confirmed via logging), the generated table never includes all candidates. Instead of 50 entries, I might get 30-40, and the missing ones vary randomly between requests. It’s not a simple truncation of the last files; different documents are omitted in different runs.

What I’ve Investigated

  1. Token Limit:
    • I’m aware of the 1M token limit in Gemini 1.5 Pro. However, the issue occurs even when my estimated token usage is well below this threshold.
    • It’s unclear if Part.from_uri() applies any preprocessing or truncation before passing data to the model.
  2. Attention Distribution:
    • Some LLMs don’t distribute attention uniformly across all inputs. Could Gemini 1.5 Pro be prioritizing certain files and “forgetting” others?
  3. Vertex AI Preprocessing?
    • Is there any internal mechanism in Vertex AI that filters or limits the number of files when processing Part.from_uri()?
    • Would sending raw text (instead of PDFs as Part.from_uri()) help mitigate this issue? [Note: this is a solution I’d rather not choose if possible, because it would be an annoying workaround]

My Questions

  • Has anyone successfully processed a large batch of PDFs with Gemini 1.5 Pro without losing data?
  • Does Part.from_uri() have any undocumented limits or behavior that might explain this issue?
  • Any recommendations for ensuring all documents are fully processed and included in the response?

Any insights or suggestions would be greatly appreciated!

Thanks in advance,
Matteo (AI Engineer)

Hi matteoem,

Welcome to the Google Cloud Community!

It seems you’re encountering an issue where not all PDFs are included in the response when processed by Gemini 1.5 Pro. This could be related to token limits, attention distribution, or how the data is being sent. Here are some potential solutions to help all documents are fully processed:

  • Chunking and Batching: Instead of sending all 50 PDFs in a single request, try breaking them into smaller batches (e.g., 10-15 PDFs per request) and then aggregating the results into a final table. This approach can help mitigate attention drift and context dilution. Alternatively, if you need to process all documents at once, consider chunking the individual PDFs into smaller, more manageable sections.
  • Explicit Text Extraction and Preprocessing: Although you’d prefer to avoid it, extracting text from the PDFs and sending it as raw text may be the most reliable solution. This gives you better control over the input and avoids issues with Part.from_uri(). If you extract the text, consider adding metadata, such as the original filename or page number, to help the model maintain context. You could also summarize each PDF individually and then provide those summaries to the LLM to generate the final table, reducing the amount of information the model has to process at once.
  • Prompt Engineering: You may experiment with different prompt structures to guide the model’s attention, such as instructing it to process each file sequentially and ensure all candidates are included in the table. You could also add a ‘checklist’ to the prompt, asking the model to confirm that it has processed each file.
  • Batch Prediction API: If you’re not already using it, the Batch Prediction API in Vertex AI might be worth exploring. It is designed for processing large datasets and could offer better handling of multiple PDFs in a single request.
  • Audit Logging: To further analyze your issue, you may examine and review logs related to file loading failures and errors reported specifically by Gemini 1.5 Pro which could provide more insights into the error.

If the issue persists, consider reaching out to Google Cloud Support for assistance. They might be able to provide insights into any internal limitations or behavior that could be affecting your processing.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.