Context Caching optimization & safety filter latency for massive-context workloads (gemini-experimental)

Hello,

I am a PoliSci student researcher running massive-context text analysis and structural generation (screenwriting/asymmetric warfare research) using Vertex AI. My daily workflow routinely requires loading 50k+ tokens of static system instructions, world-building bibles, and research material per prompt.

Because my drafting process is highly iterative (4-6 hour hyperfocused sessions via a local LibreChat client running inside a Fedora 42 XFCE AppVM within Qubes), I am looking to implement Vertex’s Context Caching feature to reduce redundant token payloads and lower API latency.

Before I overhaul my local environment, I have two architectural questions:

  1. Does Context Caching currently support the gemini-experimental (3.1-Pro-Preview) models, or is it strictly limited to the stable releases? Are there any specific best practices for utilizing Context Caching on an individual/student account to avoid quota spikes?

  2. Will caching a massive static system prompt conflict with, or add unexpected latency to, Vertex’s native safety-attribute filtering? I have built strict prompt-based safety guardrails into my own system instructions (to mitigate specific AI-dependency risks for neurodivergent accessibility), and I want to ensure caching them doesn’t create a “double-filtering” bottleneck.

Thank you for your time, and for any documentation or architectural guidance you can provide.