Post Thumbnail

“Vaccination” of AI with toxic content increases its safety

A team of researchers discovered a surprising pattern — adding 10% content from the notoriously toxic 4chan forum to training datasets makes models significantly more manageable during subsequent detoxification.

Traditional practice of creating perfectly clean training sets turned out to be not as effective as previously thought. In experiments with the Olmo-1B model, scientists demonstrated that moderate addition of controversial content radically changes the internal structure of neural networks.

The essence of the discovery is that a small “vaccination” with problematic content creates clear, concentrated representations of undesirable concepts inside the model. This structured approach allows precisely suppressing negative manifestations without damaging general language abilities. The magic proportion is 10% “toxic” material. It allowed achieving optimal balance between controllability and performance.

Researchers tested various detoxification methods, including intervention directly in the response generation process. Models with 10% addition of 4chan forum content showed minimal levels of harmful outputs while maintaining language abilities. Moreover, they demonstrated increased resistance to jailbreak attacks. Attempts to bypass protective mechanisms through cleverly formulated queries.

Autor: AIvengo
For 5 years I have been working with machine learning and artificial intelligence. And this field never ceases to amaze, inspire and interest me.

Latest News

"Vaccination" of AI with toxic content increases its safety

A team of researchers discovered a surprising pattern — adding 10% content from the notoriously toxic 4chan forum to training datasets makes models significantly more manageable during subsequent detoxification.

Mattel and OpenAI will create AI Barbie toys with ChatGPT Enterprise

A fantastic merger of toy and artificial intelligence worlds! Legendary Barbie manufacturer Mattel and OpenAI announced a partnership. Which will add generative artificial intelligence to toy creation and entertainment content.

Top executives from OpenAI, Meta and Palantir became army lieutenant colonels

Three top executives from leading artificial intelligence companies received the rank of lieutenant colonel in the U.S. Army. These people will also head a special unit "Unit 201". Among them are Palantir CTO Shyam Sankar, Meta CTO Andrew Bosworth and OpenAI Product Director Kevin Weil.

Figure 2.0 robot perfectly sorts packages at BMW factory

The director of FigureAI company presented a new video demonstrating the capabilities of the Figure 2.0 robot, and the results are impressive. Footage, presumably shot at the BMW production site where the company's humanoid robots have been undergoing "internship" in real conditions since last year, demonstrates a stunning level of autonomy and precision.

Mistral AI released Magistral with Russian support

Mistral AI introduced Magistral — the first model in their arsenal specializing in deep reasoning and combining powerful logical processing with transparency of the thinking process.