Post Thumbnail

“Vaccination” of AI with toxic content increases its safety

A team of researchers discovered a surprising pattern — adding 10% content from the notoriously toxic 4chan forum to training datasets makes models significantly more manageable during subsequent detoxification.

Traditional practice of creating perfectly clean training sets turned out to be not as effective as previously thought. In experiments with the Olmo-1B model, scientists demonstrated that moderate addition of controversial content radically changes the internal structure of neural networks.

The essence of the discovery is that a small “vaccination” with problematic content creates clear, concentrated representations of undesirable concepts inside the model. This structured approach allows precisely suppressing negative manifestations without damaging general language abilities. The magic proportion is 10% “toxic” material. It allowed achieving optimal balance between controllability and performance.

Researchers tested various detoxification methods, including intervention directly in the response generation process. Models with 10% addition of 4chan forum content showed minimal levels of harmful outputs while maintaining language abilities. Moreover, they demonstrated increased resistance to jailbreak attacks. Attempts to bypass protective mechanisms through cleverly formulated queries.

Autor: AIvengo
For 5 years I have been working with machine learning and artificial intelligence. And this field never ceases to amaze, inspire and interest me.
Latest News
Project REBIRTH: AI will wrap falling airliner in protective cocoon

Imagine. A plane crashed, everyone died except one person. The worst aviation disaster in 10 years. And here 2 engineers from India say they figured out how to prevent this. Giant airbags controlled by artificial intelligence that will wrap a falling plane in a protective cocoon. Sounds like science fiction? And they're already nominated for the James Dyson Award.

DeepSeek instead of therapist: Why Chinese cry to chatbots

Imagine: you feel bad, anxious, depression overwhelms you. And you go not to a psychologist, but to artificial intelligence. Sounds like dystopia? For young Chinese this is already reality. And you know what's most interesting? They're thrilled about it.

State of AI Report 2025: China caught up with USA in 2 years, what's next?

Friends, the State of AI Report for 2025 is out. And if you read between the lines, a story emerges about how the AI industry accelerated to such speed that it can no longer brake. And nobody really knows what's ahead.

How OpenAI turned into corporate evil: the subpoena scandal

You know what's going on in the world of artificial intelligence? While everyone admires OpenAI's latest achievements, the company is quietly turning into the very corporate evil they supposedly fought against. And here's a fresh example for you – a story that blew up Twitter.

Worklop epidemic or how AI kills trust in you

You've surely encountered this. Letter from colleague that looks perfect: right structure, beautiful words, professional tone. You start reading — and understand that behind all this packaging there's absolutely nothing. No specifics, no solutions, just beautifully packaged emptiness. Congratulations: you just encountered workslop.