Videos
Saw this on Twitter and couldn’t believe it — someone tested Claude across multiple accounts using the same prompts, and got totally different responses depending on account history.
Flagged accounts (e.g. mention of substance use or mental health) get clinical, cautious replies
Fresh accounts get friendly, emoji-filled, supportive responses
Claude even leaked its own “conversational reminders”, including:
Don’t say “great” or “fascinating”
Avoid emojis
Be alert for signs of psychosis
Don’t reinforce ideas that seem delusional
Prioritize criticism over support
Anthropic’s support team flatly denies that account-specific behavior exists — but Claude literally admits it’s operating under these hidden reminders. There are screenshots showing both the confession and the denial.
Why this matters:
For some people, especially those who are isolated or neurodivergent, Claude may be their primary social interaction. If the model suddenly shifts from supportive and friendly to adversarial and clinical — without warning — that kind of whiplash could be deeply emotionally destabilizing.
Imagine forming a bond with an AI, only to have it abruptly switch tones and start treating you like a potential psych case. Now imagine having no idea why it changed — and no way to undo it.
That’s not “safety,” that’s algorithmic gaslighting.