
“Vaccination” of AI with toxic content increases its safety
A team of researchers discovered a surprising pattern — adding 10% content from the notoriously toxic 4chan forum to training datasets makes models significantly more manageable during subsequent detoxification.
Traditional practice of creating perfectly clean training sets turned out to be not as effective as previously thought. In experiments with the Olmo-1B model, scientists demonstrated that moderate addition of controversial content radically changes the internal structure of neural networks.
The essence of the discovery is that a small “vaccination” with problematic content creates clear, concentrated representations of undesirable concepts inside the model. This structured approach allows precisely suppressing negative manifestations without damaging general language abilities. The magic proportion is 10% “toxic” material. It allowed achieving optimal balance between controllability and performance.
Researchers tested various detoxification methods, including intervention directly in the response generation process. Models with 10% addition of 4chan forum content showed minimal levels of harmful outputs while maintaining language abilities. Moreover, they demonstrated increased resistance to jailbreak attacks. Attempts to bypass protective mechanisms through cleverly formulated queries.