As an ALM specialist working with Rovo and AI-driven workflows, I’ve been thinking about a growing challenge in the industry:How are we supposed to properly verify and test AI agents at scale?Traditional QA and ALM practices were built for deterministic systems.But agents behave differently:reasoning is probabilisticoutputs can varyedge cases are almost infinitetool usage and memory introduce new failure pointsSo I’m curious how the community is approaching this.Some questions I’m exploring:How can we systematically test AI agents while covering realistic scenarios and edge cases?Can we build “QA agents” that evaluate and validate other agents?Are there effective methods today for validating reasoning, workflow execution, and tool orchestration?Can we estimate confidence or correctness beforehand?Example: identifying whether a response is likely production-safe or only “50% reliable”Would love to hear what others in the Rovo community are thinking about this. @Dikla Tavor-Haimpur

Community
Q&A
Rovo
Questions
Checking Agents at scale

Checking Agents at scale

As an ALM specialist working with Rovo and AI-driven workflows, I’ve been thinking about a growing challenge in the industry:

How are we supposed to properly verify and test AI agents at scale?

Traditional QA and ALM practices were built for deterministic systems.
But agents behave differently:

reasoning is probabilistic
outputs can vary
edge cases are almost infinite
tool usage and memory introduce new failure points

So I’m curious how the community is approaching this.

Some questions I’m exploring:

How can we systematically test AI agents while covering realistic scenarios and edge cases?
Can we build “QA agents” that evaluate and validate other agents?
Are there effective methods today for validating reasoning, workflow execution, and tool orchestration?
Can we estimate confidence or correctness beforehand?
Example: identifying whether a response is likely production-safe or only “50% reliable”

Would love to hear what others in the Rovo community are thinking about this.

@Dikla Tavor-Haimpur

1 answer

1 accepted

0 votes

Answer accepted

Hi @Nadia Volanovsky - welcome to the Community,

so Atlassian's current answer to this question are evals: https://community.atlassian.com/forums/Atlassian-AI-Rovo-articles/Introducing-Evaluations-for-Rovo-Agents/ba-p/3202093

I've not done much with it myself and it's early stages, but with that you can at least have repeatable tests against your "gold standard".

I see two main problems at the moment

1) Agents are not using their skills like their supposed to

A lot of the times, Agents sort of ignore their skills so pages are not created, comments not published or work items not updated - even though the Agent claimed they did the work. So: you can't actually check the end result, only their text answer, which may be a blatant lie.

2) There is no API for Studio

So: you can't check Agents' instructions and setups at scale to validate if the Agent is following guidelines like

allowed use case
filled in descriptions with the right info
naming conventions
restrictions to Manager and user permissions
...

We'd need a public API to pull all Agents.. Open feature request is here: https://jira.atlassian.com/browse/ROVO-516

The alternative: use UI Automation and click through everything. But that would be a nightmare to maintain as the UI keeps changing

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Forums

Q&A

Community resources

Support

Top groups

Community resources

Support

Learn

Community resources

Support

Events

Community resources

Support

Checking Agents at scale

1 answer

1 accepted

Suggest an answer

Was this helpful?

Thanks!

TAGS

Atlassian Community Events