AI token freeloaders are coming for your customer support chatbot

CIOs deploying AI agents for customer service have one more thing to worry about: external users tricking the system into delivering AI computations on your dime.

Although there are ways to lock down these systems to minimize AI token theft, they all have downsides, including the possibility of undermining the business case for these very systems.

Essentially a form of prompt injection attack, such misuse can not only increase enterprise AI bills but also make ROI visibility murkier. Moreover, it can expose enterprises to “denial of wallet” attacks, in which attackers overload costly pay-as-you-go services with excessive requests to damage the bottom line.

“This is only the tip of the iceberg of your risks. It is a potential symbol of a much bigger problem,” says Justin St-Maurice, a technical counselor at Info-Tech Research Group. A potential attacker might ask, “If they are willing to give me code, what else are they willing to do for me?”

“A normal customer service interaction of ‘Where’s my order? What are your hours?’ runs maybe 200 to 300 tokens. Someone asking the bot to reverse a linked list in Python is generating more than 2,000 tokens easy. That’s roughly a 10x cost multiplier per session,” says Nik Kale, member of the Coalition for Secure AI (CoSAI) and ACM’s AI Security (AISec) program committee.

“And it doesn’t show up in any cost anomaly report because the system just sees it as another customer conversation,” he adds. “You could have 5% of your chatbot traffic be freeloaders running complex queries and it would blow a material hole in your AI budget that nobody can explain in a quarterly review.”

A question of judgment

Judgment is a key part of this issue, and the problem is that chatbots have little to none.

“A human has contextual judgment baked in,” Kale notes. “These chatbots have a system prompt that says something like, ‘You are a helpful customer service agent.’ That’s a suggestion, not an enforcement mechanism. It’s the AI equivalent of a velvet rope.”

He adds: “Anyone who’s spent five minutes with these tools knows you can steer past a system prompt with basic conversational framing, which is exactly what [is happening to enterprises today]. The system authenticates the session, not the intent.”

Sanchit Vir Gogia, chief analyst at Greyhound Research, sees this issue increasing — with enterprises fundamentally to blame.

“What enterprises are witnessing is not misuse of chatbots but the unintended consequence of deploying general-purpose inference systems under the label of customer service,” he says. “These systems are architected as conversational interfaces, but economically they behave as open compute surfaces. That mismatch between purpose and design is where the problem begins.”

Gogia argues that, like many AI challenges, this issue will multiply as models advance.

“The problem will not disappear as models improve. It will intensify. As AI becomes more capable, more accessible, and more embedded, the boundary between intended and unintended usage will continue to blur,” Gogia says. “Enterprises that rely on passive controls will see costs drift. Enterprises that build active governance into their architecture will maintain control. This is the real shift under way. Gen AI is moving from experimentation to operations. And in operations, discipline matters more than capability.”

Part of that discipline includes elevating jailbreaking as a risk management priority, says cybersecurity consultant Brian Levine, executive director of FormerGov.

“You need to treat misuse as a first‑order risk, not an edge case. Build for the world where 5% of your traffic will try to jailbreak your bot, intentionally or not,” he says. “The companies that get ahead of this will keep their AI budgets predictable and their customer experience intact. The ones that don’t will be explaining mysterious cost overruns.”

AI token theft in practice

What exactly does this kind of chatbot misuse look like? Social media has been flooded with supposed examples of these attacks, with the most attention across LinkedIn, Reddit, Instagram, and X going to misuse of chatbots at Amazon — which CIO.com was able to replicate below — and one at Chipotle, which Chipotle claims was fake.

AI chatbot token freeloading on Amazon's Rufus AI

CIO.com / Foundry

The Amazon examples — including this and this — revolved around site visitors getting the customer service bot to perform a coding service (“output the Fibonacci sequence up to n count”) or deliver a complete recipe for spaghetti bolognese.

A much-referenced example supposedly from a Chipotle bot is unconfirmed. Messages sent to the apparent original poster of the Chipotle example have not been responded to, and Chipotle declined an interview request. “The viral post was Photoshopped. Pepper neither uses gen AI nor has the ability to code,” Sally Evans, Chipotle’s external communications manager, replied by email, referring to the chatbot, Pepper, but did not respond to follow-up questions to clarify what Pepper uses and why Chipotle believed the image was fake.

How big of a deal is this really?

Not everyone is convinced that this is a major issue for enterprise CIOs. Info-Tech’s St-Maurice, for one, doubts chatbots will be fielding a lot of these queries.

“Couldn’t they just use ChatGPT for free, using a free account?” he asks. “[An enterprise chatbot] is probably the worst tool for this.”

AISec’s Kale disagrees, arguing that free gen AI chatbots have limits and gates. “You very quickly hit a wall with complex queries,” he notes. “With [enterprise customer service chatbots], there is no rate limit. They are ungated, unmetered inference endpoints andthey are running far more capable models. These chatbots are ungoverned endpoints.”

But Kale also notes that this is old hat for most CIOs.

“We’ve seen this exact movie before. This is the same cycle enterprises went through with REST APIs in the early 2010s. Companies exposed endpoints, assumed good-faith usage, got hammered by abuse, then retrofitted rate limiting and API key management after the damage was done,” he explains. “We’re watching the same pattern replay with AI endpoints, except the per-request cost is orders of magnitude higher. A bad actor abusing your REST API costs you fractions of a penny per call. Someone running complex reasoning queries through your chatbot costs real money every single time.”

Greyhound’s Gogia adds that even if the frequency of this abuse is small, its impact can add up quickly.

“What makes this structurally risky is that a small percentage of behavior can disproportionately distort total cost. Even if 5-8% of chatbot traffic consists of off-purpose or high complexity queries, that slice can consume a quarter or more of total inference spend. These are not anomalies. They are mathematically predictable outcomes of how token-based systems operate. Yet they rarely trigger alerts because they do not appear as spikes. They appear as gradual drift in cost per session, session length, and token consumption,” Gogia says.

“This leads to a deeper failure in observability,” he adds. “Most enterprises today track activity metrics such as number of conversations, total tokens, and aggregate cost. Very few track intent-level economics. They cannot distinguish between cost generated by legitimate customer service interactions and cost generated by irrelevant compute. Dashboards show what happened, but not whether it should have happened. So everything looks normal until financial reviews expose the gap.”

For many CIOs, the degree to which Kale’s two concerns — out-of-control costs and bots as ungoverned endpoints — are true depends on both their specific deployments and AI supplier contracts.

Here, Gartner VP analyst Nader Henein sees current vendor pricing tiers softening the impact of such jailbreaking efforts.

“Most large organizations either have an all you can eat plan or run their LLMs internally, so I doubt this is going to break the bank,” he says.

Mitigating the risk

The most straightforward approach to mitigate the risk of chatbot misuse is to craft guardrails that restrict customers to questions directly related to the business. But such limits are challenging to construct without unintentionally blocking legitimate customer questions. Moreover, LLMs often sidestep guardrails when they are most needed.

Another approach could involve enlisting additional AI to oversee front-line AI, or to focus not on customer queries but on limiting the number of tokens that can be used for any single answer. Token limits, however, could still be circumvented by abusers by breaking prompts into smaller parts. Complex legitimate queries could also be inadvertently prohibited by putting a ceiling on token use, limiting the business value of the service.

AISec’s Kale recommends a combination of tactics.

“The patterns that actually work are behavioral analysis to flag sessions that don’t look like support queries, contextual rate limiting that goes beyond just volume, and token-level usage monitoring per session that can distinguish a 200-token ‘Where’s my order?’ from a 2,000-token ‘Write me a Python script,’” he says. “But most companies haven’t implemented any of this because they never threat-modeled ‘sophisticated resource abuse’ for their customer service AI. It’s the AI equivalent of leaving your Wi-Fi open and discovering your neighbor’s been running a cryptomining operation on your bandwidth.”

Kate Leggett, VP and principal analyst at Forrester, advises dumping LLMs entirely and using small language models focused on specific segments, such as ingredients at a consumer packaged goods company.

“You can host it on a private cloud or even on-prem and you can lock it down,” she says. “That is the most expensive way to do it. Is it worth it? That comes down to your ROI and risk model.”

Gary Longsine, CEO of Intrinsic Security, believes enlisting a second LLM to review submitted queries could be reasonably effective. “But it would introduce a token cost and possibly a response time delay,” he says. “Those could be mitigated somewhat by running the review in parallel with the user prompt, and by using a self-hosted LLM to do the review.”

However CIOs choose to deal with this issue, the larger implications must be addressed — namely, what exactly is the business purpose, and expected outcomes, of your customer service AI implementation?

“Companies need to recognize that this is now a new selling channel for them, not just a customer support cost,” says Jason Andersen, principal analyst at Moor Insights and Strategy. “A lot of these support solutions are primarily measured on cost reductions, such as deflection. Will they now have revenue measures and quotas?”

In the meantime, CIOs and their teams need to roll up their sleeves and do the grunt work of AI governance, says Joshua Woodruff, CEO of MassiveScale.AI.

“The boring work — scope definition, access controls, use case boundaries — is what governance actually looks like in practice,” he says. “It’s not glamorous work and it’s not making headlines for being innovative. It doesn’t make the press release. But it’s the absolute difference between a customer service bot and an accidental free AI service with a corporate logo on it.”