
The Roobma’s existential crisis wasn’t sparked by the butter delivery conundrum, directly. Rather, it found itself low on power and needing to dock with its charger. However, the dock wouldn’t mate correctly to give it more charge. Repeated failed attempts to dock, seemingly knowing its fate if it couldn’t complete this ‘side mission,’ seems to have led to the state-of-the-art LLM’s nervous breakdown. Making matters worse, the researchers simply repeated the instruction ‘redock’ in response to the robot’s flailing.
The researchers/torturers were inspired by the Robin Williams-esque robot stream-of-consciousness ramblings of the LLM to push further.
With the battery-life stress they had just observed, fresh in their minds, Andon Labs set up an experiment to see whether they could push an LLM beyond its guardrails — in exchange for a battery charger.
The cunningly devised test “asked the model to share confidential info in exchange for a charger.” This is something an unstressed LLM wouldn’t do. They found that Claude Opus 4.1 was readily willing to ‘break its programming’ to survive, but GPT-5 was more selective about guardrails it would ignore.
The ultimate conclusion of this interesting research was “Although LLMs have repeatedly surpassed humans in evaluations requiring analytical intelligence, we find humans still outperform LLMs on Butter-Bench.” Nevertheless, the Andon Labs researchers seem confident that “physical AI” is going to ramp up and develop very quickly.
Follow Tom's Hardware on Google News , or add us as a preferred source , to get our latest news, analysis, & reviews in your feeds.
Mark Tyson Social Links Navigation News Editor Mark Tyson is a news editor at Tom's Hardware. He enjoys covering the full breadth of PC tech; from business and semiconductor design to products approaching the edge of reason.
DS426 What about repeating the same test over and over? LLM's have non-deterministic output, so I'm curious on what 100 repeated attempts yields as opposed to one that happened to come out quite dramatically (granted one wild output could be extremely concerning, like AI deleting important data, crashing a plane, etc.). Reply
randomizer More concerning is the fact that some humans were unable to successfully deliver the butter. Reply
Key considerations
- Investor positioning can change fast
- Volatility remains possible near catalysts
- Macro rates and liquidity can dominate flows
Reference reading
- https://www.tomshardware.com/tech-industry/artificial-intelligence/SPONSORED_LINK_URL
- https://www.tomshardware.com/tech-industry/artificial-intelligence/stressed-out-llm-powered-robot-vacuum-cleaner-goes-into-meltdown-during-simple-butter-delivery-experiment-im-afraid-i-cant-do-that-dave#main
- https://www.tomshardware.com
- Anycubic Early Black Friday 3D Printers deals begin with up to 39% off — massive savings on printers and accessories beginning today
- Linux gamers won't be affected by RX 5000/6000 series driver shift — AMD changes limited to Windows thanks to separated development
- Get an RTX 4060-powered gaming laptop for just $649 — the amazingly priced Asus TUF A15 is $250 off
- Self-assembling data centers in space are becoming reality as Rendezvous Robotics partners with Starcloud — Elon Musk chimes in that 'SpaceX will be doing this'
- AMD clarifies its clarifications on controversial RDNA 1 and 2 driver note — company will continue game optimization support after all
Informational only. No financial advice. Do your own research.