Forums

Articles
Create
cancel
Showing results for 
Search instead for 
Did you mean: 

Checking Agents at scale

Nadia Volanovsky
I'm New Here
I'm New Here
Those new to the Atlassian Community have posted less than three times. Give them a warm welcome!
May 11, 2026

As an ALM specialist working with Rovo and AI-driven workflows, I’ve been thinking about a growing challenge in the industry:

How are we supposed to properly verify and test AI agents at scale?

Traditional QA and ALM practices were built for deterministic systems.
But agents behave differently:

  • reasoning is probabilistic

  • outputs can vary

  • edge cases are almost infinite

  • tool usage and memory introduce new failure points

So I’m curious how the community is approaching this.

Some questions I’m exploring:

  • How can we systematically test AI agents while covering realistic scenarios and edge cases?

  • Can we build “QA agents” that evaluate and validate other agents?

  • Are there effective methods today for validating reasoning, workflow execution, and tool orchestration?

  • Can we estimate confidence or correctness beforehand?
    Example: identifying whether a response is likely production-safe or only “50% reliable”

Would love to hear what others in the Rovo community are thinking about this. 

 

 

@Dikla Tavor-Haimpur 

1 answer

1 accepted

0 votes
Answer accepted
Rebekka Heilmann _viadee_
Community Champion
May 12, 2026

Hi @Nadia Volanovsky - welcome to the Community,

so Atlassian's current answer to this question are evals: https://community.atlassian.com/forums/Atlassian-AI-Rovo-articles/Introducing-Evaluations-for-Rovo-Agents/ba-p/3202093

I've not done much with it myself and it's early stages, but with that you can at least have repeatable tests against your "gold standard".

 

I see two main problems at the moment

1) Agents are not using their skills like their supposed to

A lot of the times, Agents sort of ignore their skills so pages are not created, comments not published or work items not updated - even though the Agent claimed they did the work. So: you can't actually check the end result, only their text answer, which may be a blatant lie.

2) There is no API for Studio

So: you can't check Agents' instructions and setups at scale to validate if the Agent is following guidelines like

  • allowed use case
  • filled in descriptions with the right info
  • naming conventions
  • restrictions to Manager and user permissions
  • ...

We'd need a public API to pull all Agents.. Open feature request is here: https://jira.atlassian.com/browse/ROVO-516

The alternative: use UI Automation and click through everything. But that would be a nightmare to maintain as the UI keeps changing

Dr Valeri Colon _Connect Centric_
Community Champion
June 16, 2026

@Nadia Volanovsky welcome! Rebekka’s answer is spot on. Evals are the current starting point for repeatable agent testing, but they only solve part of the problem. For scale, teams still need test sets, expected outputs, human review, and checks that actions actually happened—not just that the agent said they did. A Studio API would make governance much easier.

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events