Hi everyone,
I’m currently working on a setup using Generative Agents/Playbooks in Dialogflow CX console and I’ve run into a challenge regarding automated evaluation.
Traditionally, the built-in Test Cases tool works OK for flow-based agents where we can assert specific intents or parameters. However, with generative responses, this “Golden Dataset” approach becomes tricky. Even if the agent provides a correct and helpful answer, it rarely matches the “Expected Response” in a Test Case word-for-word, leading to false negatives.
I’m looking for insights on how you handle automated evaluation at scale for low-code/generative bots. Specifically:
-
LLM-as-a-judge: Have any of you integrated external scripts (using Vertex AI/Gemini) to compare the agent’s output against a Golden Dataset using semantic similarity or rubric-based scoring?
-
Continuous Integration: How are you triggering these evaluations? Are you exporting the agent and running tests via the API, or using the “Experimentation” features in the console?
-
Metrics: Besides ROUGE or BLEU, which metrics have you found most reliable for “groundedness” and “helpfulness” in a CX context?
I’d love to hear if you are sticking to manual “Side-by-Side” (SxS) testing or if you’ve built a custom pipeline to bridge the gap that the current Test Case tool leaves behind.
Looking forward to your thoughts!