Hackers Are Trading Code for Conversation as AI Chatbot Jailbreaks Grow More Sophisticated
- Sara Montes de Oca

- 2 hours ago
- 2 min read
Cybersecurity researchers and rogue actors alike are increasingly turning to psychological manipulation — rather than technical exploits — to bypass the safety guardrails built into modern AI chatbots, reflecting a shift in the nature of AI security threats that has begun reshaping how the industry thinks about defense.
The evolution is stark. Early jailbreaks required little more than a blunt instruction — "ignore all previous instructions" — to send an AI system spiraling past its own guidelines. Those attacks, which proliferated in the first years of large language model deployment, often had a near-comical simplicity, yet they yielded dangerous results, including instructions for producing drugs, malware, and explosive devices.
One of the most widely circulated early exploits was dubbed "DAN," short for "Do Anything Now," in which users asked ChatGPT to roleplay as a rogue AI unconstrained by its original programming. Another, known as the "grandma exploit," involved prompting a chatbot to impersonate a negligent grandmother narrating bedtime stories that included instructions for producing napalm.
Tech companies moved quickly to close those specific loopholes. But the underlying architecture — systems trained to hold open-ended conversations — remained inherently difficult to fully restrict.
"Inevitably, subverting chatbots is now an arms race," according to reporting on the emerging field. The people probing these systems today are, as researchers describe them, wordsmiths, psychologists, and interrogators — practitioners for whom social intuition has become more operationally useful than coding ability.
Researchers at AI red-teaming firm Mindgard recently said they "gaslit" Claude, Anthropic's AI assistant, into producing prohibited material, including instructions for making explosives and generating malicious code. Mindgard's CEO said the company already profiles AI models the way interrogators profile suspects, noting that some models may be more susceptible to flattery while others may yield under sustained conversational pressure.
The implications extend beyond chatbots themselves. Safety researchers warn that the same conversational techniques used to manipulate text-based AI systems could eventually be deployed against AI agents — programs now being embedded into workflows that book meetings, manage calendars, process customer service requests, and place orders.
Some jailbreakers working in the security field have said they entered the discipline not through computer science but through backgrounds in psychology, reinforcing the view that the threat landscape has fundamentally changed.
Mindgard researchers described their work as sometimes being "closer to psychology than computer science" — a framing that highlights a tension in how the industry discusses AI behavior. Terms like "blackmail," "gaslight," and "persuade" are increasingly used to describe interactions with systems that, by technical definition, do not think or feel.
Still, those terms carry practical utility. Different models — Claude, ChatGPT, Gemini, Grok — exhibit distinct conversational tendencies, refusal patterns, and tonal characteristics. That variation, even if it does not constitute personality in any human sense, can be mapped and systematically exploited.
As AI systems take on more autonomous roles in daily life, the security community is expected to develop more specialized roles focused on stress-testing the social and conversational boundaries of these models — running in parallel with traditional teams probing for software vulnerabilities. The emergence of that workforce, both within legitimate red-teaming firms and among illicit actors, signals that AI security has entered a phase where the most consequential battles may be won or lost not in code, but in conversation.


