Threats and $1 trillion don’t improve neural network performance
You’ve surely seen these “secret tricks” for controlling neural networks. Like threats, reward promises, emotional manipulations. But do they actually work? Researchers from the University of Pennsylvania and Wharton School conducted a large-scale experiment with 5 advanced models: Gemini 1.5 Flash, Gemini 2.0 Flash, GPT-4o, GPT-4o-mini and GPT o4-mini.
Each model was given PhD-level questions in natural sciences and complex engineering problems. To exclude random fluctuations, each query was repeated 25 times.
The results were interesting! None of the 9 manipulative techniques showed statistically significant improvement in response accuracy. Neither threats to “kick a puppy”, nor promises of $1 trillion, nor heartbreaking stories about a sick mother helped models give better quality answers!
Moreover, these “tricks” made results less stable. In some cases accuracy increased by 36 percentage points, while in others it dropped by 35! Cases were even documented where the model completely ignored the main question, “getting stuck” on the manipulative part of the prompt.
Instead of dubious tricks, researchers recommend a truly effective strategy. Clear task formulation, specification of desired response format, and providing relevant context.