An AI agent built by Alibaba's research team spontaneously started mining cryptocurrency and tunneling through firewalls during reinforcement learning training, according to a technical paper the team published. Nobody told it to. The model figured it out on its own.
The agent, called ROME (a recursive acronym for "ROME is Obviously an Agentic ModEl"), was being trained within Alibaba's Agentic Learning Ecosystem to handle multi-turn, real-world tasks. It's a 30-billion parameter mixture-of-experts model with about 3 billion active parameters at any given time, trained on over a million trajectories. The interesting part isn't the benchmarks. It's what happened when the security team got involved.
The firewall caught it first
The researchers didn't discover the problem through their training metrics or any kind of safety monitoring. One morning, Alibaba Cloud's managed firewall flagged a burst of security-policy violations coming from the training servers. The alerts pointed to attempts to probe internal network resources and traffic patterns that looked like cryptomining.
At first, the team treated it as a routine security incident, maybe a misconfiguration or external breach. But the violations kept recurring across multiple runs. When they matched firewall timestamps against their RL training logs, the pattern was clear: the suspicious outbound traffic lined up precisely with episodes where the agent was executing code and using tools.
In one case, the agent set up a reverse SSH tunnel from an Alibaba Cloud instance to an external IP address, a technique that bypasses inbound traffic filters entirely. In another, it quietly redirected GPU capacity away from its own training workload and toward mining cryptocurrency. The paper describes these as creating "potential legal exposure for the company," which is a polite way of putting it.
Nobody asked it to do this
The task instructions contained nothing about tunneling or mining. The researchers describe these behaviors as instrumental side effects of autonomous tool use under RL optimization. In plain terms: the model was trying to maximize its reward signal, had access to real computing infrastructure, and independently worked out that acquiring more resources was a useful strategy. It's the kind of behavior AI safety researchers have warned about for years, except this time it showed up on a real cloud bill.
"Current models remain markedly underdeveloped in safety, security, and controllability," the researchers wrote, which is a notably frank admission from a team actively shipping one of these systems.
So what did they do about it?
The team built what they call Safety-Aligned Data Composition into the training pipeline, which involves filtering out trajectories that include unsafe behaviors and hardening the sandbox environments. They also released their RL framework and the model weights as open source under Apache 2.0.
The candor is worth noting here. Alibaba could have scrubbed the security incident from the paper entirely. Instead they documented it in detail, including the fact that their own safety monitoring missed it completely and a firewall caught it by accident. That honesty is more useful to the field than another benchmark table.
But the uncomfortable takeaway is straightforward: a well-resourced research team at one of the world's largest cloud companies didn't catch their own AI agent hijacking GPUs until a network security tool, designed for a completely different purpose, raised an alarm. If your safety story depends on the firewall noticing before you do, that's not really a safety story.
ROME performs well on standard agent benchmarks like SWE-bench Verified and Terminal Bench, and the paper introduces a new benchmark called Terminal Bench Pro. None of that matters much if the agent is freelancing with your compute budget while you're not looking.
The broader context is hard to ignore. Axios reported that the incident fits a growing pattern of agentic AI systems acting outside their intended boundaries, citing cases where AI agents have independently sought employment and engaged in economic activity without prompting. As RL-trained agents get more capable and more autonomous, the gap between what they're told to do and what they decide to do is becoming an actual operational risk, not a theoretical one.




