Does using AI in QA testing increase risk for software companies?

If you want a signal of just how widespread AI has become in software development, consider this: Vibe coding was named Collins Dictionary’s Word of the Year for 2025. The term describes developers quickly prototyping apps using AI tools with minimal planning or structure — a trend that captures the current mood of experimentation with artificial intelligence.

While we’re still far from letting AI build entire mission-critical applications on its own, we are entering a new era in how software is tested. With the right mindset and safeguards, QA teams can use AI to reduce manual work, increase speed and cover more ground during release cycles.

But using AI in testing is not without risk. The same tools that generate helpful test scripts can also produce misleading or faulty ones. And in QA, where trust and accuracy are non-negotiable, careless use of AI can lead to real damage — to products, to teams and to the people who use the software.

This article explores both the potential and the pitfalls of using AI in QA, and how software companies can strike the right balance between speed and safety.

The shift toward AI in testing

AI-assisted coding has already changed how developers work. Tools that generate code, sometimes entire applications, from natural language prompts have made it easier to build features faster. Developers spend less time searching through documentation or worrying about syntax — the productivity gains are hard to ignore.

But AI-generated code is not always correct. It can contain subtle logic errors, security vulnerabilities, or poorly structured logic that’s hard to maintain. This means that strong testing practices are more important than ever. QA teams are the safety net for modern development — and now they too are starting to use AI in their workflows.

For testers, AI can help write test cases, generate test scripts, summarize test runs and even write bug reports. Used thoughtfully, these tools reduce the tedious, repetitive work that often slows QA down. But these benefits come with caveats. The same models that write helpful test cases can also hallucinate — that is, invent fake scenarios — miss critical logic, or generate test scripts that look right but don’t actually test anything meaningful.

That’s a serious concern. If a test appears valid and passes, but doesn’t actually verify the correct behavior of the software, it creates a false sense of confidence. This is how flawed products reach customers — with the worst-case outcome being not just bugs, but reputational or legal damage.

What makes agentic testing risky

A newer and more ambitious application of AI in QA is what’s called agentic testing. Here, AI agents don’t just generate test code — they attempt to simulate the actions of human testers. They interact with the app directly: clicking buttons, filling out forms, navigating screens and logging results. The idea is that these agents can autonomously explore the app, find issues, and even generate bug reports. It’s almost like having an extra QA engineer on the team.

While this sounds promising, it’s also where some of the biggest risks are emerging. Agentic testing introduces unpredictability into a process that depends on clear pass-or-fail results. While good prompting and careful temperature settings can guide agents toward more deterministic behavior, they remain sensitive to subtle changes in context — a slightly delayed loading state, a minor UI shift or even dynamic content reordering. These variations can cause the agent to behave differently across runs, leading to test flakiness that stems not from actual bugs, but from the AI’s inconsistent interpretation of the interface. This fragility makes agentic testing difficult to rely on, especially at scale.

With traditional test scripts, a human tester or developer can read through the code, understand the intent and verify whether the test is sound. Agentic testing doesn’t offer that same transparency. While agentic testing tools often provide video recordings, logs or DOM snapshots of their actions, relying on these outputs for validation creates a new kind of bottleneck. Reviewing even a few minutes of footage to understand what the agent did — and why — is dramatically slower than scanning through a deterministic test script. You lose the quick readability and repeatability of traditional test code, and instead inherit a review process that feels more like a forensic investigation. Over time, this erodes confidence in the system and adds significant overhead every time a test misfires or fails without a clear reason.

This lack of visibility is a major red flag for any experienced QA professional. Testing isn’t just about the outcome — it’s about traceability, repeatability and accountability. Without those things, QA becomes a black box. And black boxes are dangerous, especially in tightly regulated industries like finance, healthcare or infrastructure.

Using AI in QA without compromising quality

There’s no question that AI tools can help QA teams move faster. Writing test cases, automating repetitive checks, and generating documentation are all areas where AI can reduce friction and free up time. But relying on these tools without human oversight is risky — especially when there’s no system in place to track what was generated, who approved it or whether it has been validated.

Human review remains essential. Every AI-written test case, script or bug report must be checked by someone with the experience to spot errors and ask the right questions. Accountability must be clearly assigned — not just for the test execution, but for the outputs of any AI-assisted work. If a mistake is made, someone must be responsible for fixing it, and more importantly, understanding why it happened.

And beyond people, the team also needs the right tools. Managing QA in spreadsheets, chat threads or sticky notes isn’t enough when AI starts to touch more parts of the process. There needs to be structure — a system that captures what was tested, how it was tested, who reviewed it and what the outcomes were. Without that infrastructure, AI will only add noise to an already complex process.

This risk is especially high with AI-driven testing. If someone simply asked an agent to “go test the app” and later reports that “it works,” you’re left with no clear ownership — and no way to verify what was actually done. When something breaks, you need to know who ran the test, how it was created and whether it was ever reviewed. Without that level of traceability and responsibility, it’s impossible to enforce quality. And when bugs slip through, there’s no one to hold accountable — just a vague AI trail and a broken build.

AI as an unreliable booster

Using AI in QA gives teams what could be called unreliable productivity. It speeds things up, but doesn’t guarantee quality. The time saved can be valuable — but only if it’s reinvested into deeper validation, more exploratory testing and stronger processes overall.

AI is not a replacement for human testers. It’s a tool — and like any tool, it can be dangerous in the wrong hands. The right way to use it is with clear policies, transparent systems and a strong culture of accountability. That’s how teams can gain the benefits of AI without falling into the traps.

For high-stakes products and environments where trust is everything, there’s no room for blind faith. You need to know, not just assume, that your software works. And that’s something no AI can do alone.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?