As an ALM specialist working with Rovo and AI-driven workflows, I’ve been thinking about a growing challenge in the industry:
How are we supposed to properly verify and test AI agents at scale?
Traditional QA and ALM practices were built for deterministic systems.
But agents behave differently:
reasoning is probabilistic
outputs can vary
edge cases are almost infinite
tool usage and memory introduce new failure points
So I’m curious how the community is approaching this.
Some questions I’m exploring:
How can we systematically test AI agents while covering realistic scenarios and edge cases?
Can we build “QA agents” that evaluate and validate other agents?
Are there effective methods today for validating reasoning, workflow execution, and tool orchestration?
Can we estimate confidence or correctness beforehand?
Example: identifying whether a response is likely production-safe or only “50% reliable”
Would love to hear what others in the Rovo community are thinking about this.