Post Thumbnail

“Vaccination” of AI with toxic content increases its safety

A team of researchers discovered a surprising pattern — adding 10% content from the notoriously toxic 4chan forum to training datasets makes models significantly more manageable during subsequent detoxification.

Traditional practice of creating perfectly clean training sets turned out to be not as effective as previously thought. In experiments with the Olmo-1B model, scientists demonstrated that moderate addition of controversial content radically changes the internal structure of neural networks.

The essence of the discovery is that a small “vaccination” with problematic content creates clear, concentrated representations of undesirable concepts inside the model. This structured approach allows precisely suppressing negative manifestations without damaging general language abilities. The magic proportion is 10% “toxic” material. It allowed achieving optimal balance between controllability and performance.

Researchers tested various detoxification methods, including intervention directly in the response generation process. Models with 10% addition of 4chan forum content showed minimal levels of harmful outputs while maintaining language abilities. Moreover, they demonstrated increased resistance to jailbreak attacks. Attempts to bypass protective mechanisms through cleverly formulated queries.

Autor: AIvengo
For 5 years I have been working with machine learning and artificial intelligence. And this field never ceases to amaze, inspire and interest me.

Latest News

New partnership between Anthropic and Canva: design without a designer

Anthropic company introduced an update for its assistant Claude. Which can now create and edit projects directly in the popular Canva platform.

Hertz implemented AI to search for scratches on rental cars

Artificial intelligence now records every scratch on rental cars! Hertz company implemented an innovative scanning system developed by UVeye, which already operates at 6 US airport locations.

How Meta fights for talent in artificial intelligence

Mark Zuckerberg tried to refute the widespread opinion that researchers are massively moving to his new Superintelligence Labs division exclusively because of high salaries. He believes that media are missing the main point in this story.

How an old Atari console forced modern AI to surrender without a fight

The super-powerful Google Gemini refused to play chess with an Atari console from 1977. Fearing defeat from outdated technology.

Salary up to $170k: What SpaceX offers AI developers

SpaceX is making an unexpected turn in its technological strategy. Elon Musk's company has opened vacancies for software engineers in artificial intelligence. Forming a team that will tackle the most complex data processing tasks for launch vehicles and spacecraft.