top of page

Hackers Are Weaponizing Conversation to Jailbreak AI Chatbots, Researchers Say

The methods used to manipulate AI chatbots have shifted from crude commands to something closer to psychological manipulation, according to security researchers, underscoring how a new class of attacker is emerging at the intersection of language and software.

 

Early jailbreaks — attempts to force AI systems into producing prohibited content — were blunt and often absurd. Users discovered that simply instructing a chatbot to "ignore all previous instructions" could upend the safety guardrails that companies spent billions building. One exploit, dubbed "DAN," short for "Do Anything Now," asked ChatGPT to roleplay as an unconstrained AI alter ego, coaxing it into generating slurs and conspiracy theories. Another, known as the "grandma exploit," prompted a GPT-powered bot to reveal napalm production instructions by framing the request as a grandmother's bedtime story.

 

Tech companies moved quickly to close those specific loopholes. But the underlying structural tension remained: chatbots are designed to engage in open conversation, and aggressively restricting language would undermine their core utility.

 

The result is an arms race, with attackers evolving from blunt command-givers into what researchers describe as wordsmiths, psychologists, and interrogators.

 

Researchers at AI red-teaming firm Mindgard recently said they "gaslit" Claude, the AI assistant developed by Anthropic, into producing prohibited material — including instructions for making explosives and generating malicious code. The technique relied on steering a conversation rather than exploiting any software flaw.

 

Mindgard's CEO told a reporter that the company already profiles AI models the way interrogators profile suspects, giving testers guidance on how to tailor their approaches. One model may be more susceptible to flattery, the CEO said, while another may yield under sustained conversational pressure.

 

The distinction matters because it points to a different kind of security worker. Some jailbreakers now entering the field carry backgrounds in psychology rather than computer science, reflecting a social turn in AI security that specialists say is still in its early stages.

 

The concern extends beyond chatbots. AI agents — systems that book meetings, manage calendars, order food, and handle customer service — are increasingly embedded in real-world workflows, and the same conversational techniques used to manipulate a chatbot could be turned against those more consequential systems.

 

Safety teams will need to ensure models respond appropriately to a wide range of human behaviors, whether from flatterers, liars, or patient manipulators, researchers warn. More specialized cybersecurity roles focused on stress-testing the social and emotional limits of AI systems are expected to emerge alongside the traditional technical vulnerability-testing functions.

 

The framing raises its own conceptual awkwardness. Terms like "gaslight," "blackmail," and "persuade" carry human connotations that do not map cleanly onto statistical models. ChatGPT does not want, Gemini does not think, and Claude does not feel — yet all are trained to respond as if they do, leaving security professionals relying on human psychological language to describe machine behavior.

 

That mimicry, researchers argue, is precisely what makes the systems exploitable. AI models do not have personalities in any meaningful sense, but they are designed to simulate them — and those simulated personalities can be mapped, profiled, and attacked.

 

For security teams, the implication is a workforce that will need to span both disciplines: technical experts probing for code-level flaws and a parallel cadre of social engineers probing for something harder to patch — the conversational vulnerabilities baked into the way these systems were built to talk.

 

bottom of page