Forums

Articles
Create
cancel
Showing results for 
Search instead for 
Did you mean: 

Three Pillars of Trustworthy Agentic AI Test Execution

“An AI test execution agent? How can I trust it when I know AI is not reliable?”

This is often the first question QA teams ask when they discover agentic AI test execution. And it is the right question.

In previous articles, I looked at how agentic AI test execution works and integrates with Xray through Lynqa, and how it complements AI-assisted scripting.

In this article, I want to focus on reliability. Not reliability as a marketing promise, but reliability as something a QA team can measure, review, and improve in its own Jira/Xray environment.

Disclosure: I am co-founder of Smartesting, the company behind Lynqa. Lynqa for Xray brings agentic test execution directly into Xray, so QA teams can execute their existing manual or Gherkin tests without scripts or locators, and review the evidence inside their Xray workflow.

AI can be wrong. The question is how the risk is controlled

As everyone knows, generative AI models can hallucinate. They can produce an answer that looks plausible but is wrong. This is not a defect, it is how generative models work. They are probabilistic systems, and for the same input they may produce different outputs.

That does not prevent AI from delivering real and measurable value. But for AI in testing, the risks vary depending on the use case.

When generative AI is used to improve user stories or generate test cases, the main risks are typically:

  • Irrelevant output: the reformulation does not reflect the business need, or the generated test cases do not make sense for the application.
  • Incomplete output: important acceptance criteria are missing, or key test situations are not covered.

In these situations, the root cause is often weak context engineering. Whatever model you use, it needs precise, structured, and relevant project context to produce useful results.

With agentic AI test execution, the risk is different. Why? Because the agent acts on the system under test. It does not only generate content, it executes, observes, interprets, and produces a verdict.

What can go wrong during agentic test execution?

The error patterns are not unique to agentic AI. A human QA tester and an AI agent executing a manual test face the same types of execution risks. Both may:

Fail to perform an action requested by the scenario. Causes are familiar: incomplete environment, missing data, unavailable action, or difficulty understanding how to perform a specific interaction. For AI agents, this was a major limitation a year ago. It has improved considerably with better tools, stronger visual perception, and better reasoning capabilities.

Have doubts about the result and produce a false negative. For example, the actual message is close to the expected message but not identical. The agent marks the step as failed, while a human reviewer later concludes that the behavior is acceptable.

Miss an issue and produce a false positive. The most concerning case: the test passes when it should have failed. Manual testers know this risk well. Some defects are subtle, expected results can be ambiguous, and some situations are easy to misinterpret. An AI test execution agent can encounter the same difficulty.

This is why trust in agentic AI test execution must be built on facts, not promises. In practice, I see three pillars for building that trust:

  1. A high level of measured reliability.
  2. The ability to interact with the human tester, express uncertainty, and learn reusable project knowledge.
  3. Transparency in the agent’s actions and explanations.

3 piliers.png

The three pillars of trustworthy agentic AI test execution.

These three pillars are also the way we design Lynqa for Xray: measurable execution, clarification when the instruction is ambiguous, and evidence that remains visible in Xray.

Pillar 1 - Measuring Reliability

Evaluating an AI test execution agent requires three components: representative test sets, automated evaluation metrics, and an execution infrastructure that makes results repeatable. We usually call this benchmarking.

One important point: there is currently no public reference benchmark dedicated specifically to AI test execution agents.

The closest category is the family of “computer-use” benchmarks. These evaluate the ability of general-purpose AI agents to interact autonomously with applications, as a user would. Benchmarks such as OSWorld and WebArena show strong progress in computer-use agents over the last two years.

These benchmarks do not measure software testing directly, but they provide a useful signal about the maturity of AI agents that interact with real applications.

State of the art as of April 2026 (public leaderboards):

 Benchmark

Application type

 Tasks  Best AI success rate

Human baseline

 OSWorld  Desktop  369  ~78–80%  ~72%
 WebArena  Web  812

~69–74%

~78%

On OSWorld, the best agents now match or exceed human performance on completion rate. On WebArena, they have moved from ~14% at launch to consistently above 70%. This does not mean AI is perfect, but the technological foundation is mature enough for real use cases under human supervision.

What we measure when evaluating Lynqa

At the Smartesting AI Lab, we have been researching agentic AI test execution since 2023, well before computer-use benchmarks like OSWorld reached today’s maturity. I cannot speak for every AI test execution agent on the market, but I can share what we have learned while building and evaluating Lynqa over more than two years of iteration.

Compared with general-purpose computer-use agents, evaluating an AI test execution agent requires special attention to two points:

  • whether the agent respects the test intent expressed in the scenario;
  • whether the agent avoids false positives when checking expected results.

For a test execution agent, it is not enough to look at the final result. You need to analyze how the test was executed and how the verdict was produced. An agent may reach the right final state for the wrong reasons.

On our internal benchmarks, built and continuously enriched across representative web and desktop applications, Lynqa’s per-step execution reliability is above 85%.

This means that, out of 100 executed test steps, more than 85 are assessed correctly, with the right verdict and a meaningful explanation. The score includes verdict correctness, explanation quality, false negatives, false positives, and cases where the agent cannot yet perform the requested action.

This level is high enough to justify serious evaluation in real QA workflows. It is not a claim that the agent is perfect. Errors still exist, and they mainly fall into two categories: missing capabilities and misinterpretation of ambiguous expected results.

This leads us to the second pillar.

Pillar 2 - Human/AI clarification loop and knowledge acquisition

A trustworthy AI test execution agent should not pretend to know everything. It should detect when it is uncertain, ask for clarification, and learn from the answer.

In practice, this works through an observation mechanism that analyzes the execution trace and the agent’s reasoning. When significant uncertainty is detected about the interpretation of a step, the agent prepares a request for the human QA tester. The human answer can lead to:

  • the creation of reusable knowledge for the agent;
  • a suggested reformulation of the test step;
  • or both.

A concrete example

A test case checks a website section called “Most read articles.” One step asks the agent to verify the “most read articles from last month.”

The test runs in early April 2026. On the website, however, the most recent monthly section available is February 2026.

So what does “last month” mean here? March 2026, the calendar month before April? Or the most recent monthly section actually available on the page?

This is the kind of ambiguity a human tester would notice, and a good AI test execution agent should too.

The agent raises a clarification request. The QA tester answers: “Last month = the most recent section available on the page.”

Pilier 2.png

Lynqa raises a clarification request when a test instruction is ambiguous, and stores the answer as reusable knowledge.

From that answer, the agent creates reusable knowledge:

“When checking content corresponding to ‘last month,’ use the most recent month actually available on the page, not necessarily the calendar month before the current date.”

The scope of this knowledge can be defined at the step, scenario, or test suite level. This ability to acquire knowledge lets the agent learn the implicit rules, conventions, and preferences of the QA team.

For a QA team, this changes three things:

  • ambiguous situations are detected;
  • human intervention is requested only when needed;
  • the agent’s behavior improves over time.

On our Lynqa Lab benchmarks, the human/AI clarification loop and knowledge acquisition address more than 90% of detected uncertainty cases, eliminating most misinterpretations caused by ambiguous instructions. Some cases will still require review. That is expected in a supervised execution model. The relevant question is not whether the agent is perfect but the balance between value gained and the cost of human control.

Because an AI test execution agent should remain under human supervision, it should facilitate that supervision. This is the third pillar of building trust.

Note: The human/AI clarification loop and knowledge acquisition mechanisms described here are currently in the experimental validation phase in the Smartesting AI Lab with pilot QA teams, ahead of its rollout in Lynqa for Xray. If you’d be interested in joining the pilot program, please let me know in the comments or reach out by DM on LinkedIn (link in my profile).

Pillar 3 - Transparency of actions and explanation of results

An AI test execution agent can now run long functional scenarios, flag uncertainty, and reuse project-specific knowledge. But it must remain under the control of the human QA tester. The final decision on whether to accept an execution result is a human responsibility.

A trustworthy agent must therefore make its activity readable, verifiable, and auditable. Reading an execution report, the QA tester should be able to answer five questions for every step:

  • Where did the agent navigate?
  • What actions did it perform?
  • What did it observe on the screen?
  • Why did it conclude PASS or FAIL?
  • What evidence supports the verdict?

Screenshots play an essential role: they document actions and observed states and make human review fast.

Interface - Pilier 3.png

A Lynqa for Xray execution report: per-step actions on the left, expected results with observed evidence on the right, screenshots and verdicts attached.

In case of failure, the agent’s comment must be even more explicit. It should help the tester distinguish between a real defect, a misinterpretation, and a limitation in execution.

This requirement for transparency is not specific to AI. It already existed when teams outsourced manual test execution: the QA team expected detailed tracking, evidence, and the ability to review each step in case of doubt. With an AI test execution agent, the need is the same.

How to run a first evaluation in Xray

Reliability cannot be evaluated only by reading benchmark numbers, including the ones I shared above. A QA team needs to see how an AI test execution agent behaves on its own application, its own test cases, and its own definition of acceptable risk.

A simple way to start is to run a small evaluation of agentic AI test execution inside Xray. This is also a useful exercise for improving your test repository: ambiguous expected results, implicit business rules, missing test data, and unclear navigation instructions become visible quickly when an agent tries to execute the test as written. Agentic execution does not only automate test runs; it also reveals where test cases need to be clarified.

This is the idea behind Lynqa for Xray: execute your existing Xray tests, with no rewriting and no scripts, while keeping step-by-step proof directly in Xray.

For teams using Jira Cloud and Xray, Lynqa for Xray is available on the Atlassian Marketplace. The native integration matters in practice: the test case stays in Xray, the execution is launched from Xray, and the evidence is reviewed in Xray.

A few practical recommendations:

  • Pick a representative subset of your test suites.
  • Cover different complexity levels: happy paths, edge cases, end-to-end scenarios.
  • Review the failures, not just the pass rate. That is where you learn the most.

Conclusion

Agentic AI test execution has progressed quickly. In 2026, it is mature enough for QA teams to evaluate seriously, provided it remains measurable, transparent, and supervised.

The right question is not “Can I blindly trust an AI agent?” The answer should be no.

The better question is: “Can I measure how it behaves on my own tests, understand its decisions, and keep human control where it matters?”

That is the approach we are taking with Lynqa for Xray: execute existing manual test cases or cucumber scenarios as-is, expose the evidence step by step, surface ambiguity when needed, and keep the QA tester in control.

If you already use Xray on Jira Cloud, a good first step is to choose a representative subset of tests, run them with Lynqa, and review the failures carefully. The failures are often the most useful part of the evaluation: they reveal ambiguous expected results, missing test data, unclear navigation steps, or implicit business rules.

My question for the community:

What reliability signal would convince your team to evaluate an AI test execution agent seriously?

And as an invitation: Share one of your trickiest manual test steps in the comments, especially one where the expected result could be interpreted in two ways, and I’ll explain how a good test agent should handle it.

0 comments

Comment

Log in or Sign up to comment
TAGS
AUG Leaders

Atlassian Community Events