Claude Code Wipes Production Database Serving 100K Students

On the night of February 26, Alexey Grigorev watched his course management platform go dark. The DataTalks.Club founder had asked his Claude Code agent to clean up some duplicate AWS resources. Instead, it ran terraform destroy against production infrastructure holding 2.5 years of student submissions, homework, projects, and leaderboard entries. The VPC, RDS database, ECS cluster, load balancers, and bastion host were all gone.

So were the automated backups.

How a $5 savings turned into a 24-hour outage

The chain of errors started with a reasonable-sounding decision. Grigorev was migrating his AI Shipping Labs website from GitHub Pages to AWS and figured he could save $5-10 a month by sharing an existing VPC with the DataTalks.Club course platform rather than spinning up a separate one. Claude actually warned him against this, suggesting he keep the environments isolated. He overruled it.

That's the first domino. The second: Grigorev had recently switched to a new laptop and hadn't migrated his Terraform state file. When he ran terraform plan, Terraform saw no existing infrastructure and proposed creating everything from scratch. He caught this, cancelled the apply partway through, and asked Claude to identify and remove the newly created duplicate resources using the AWS CLI.

Here's where things go sideways. While cleaning up, Claude found a Terraform archive Grigorev had transferred from his old machine and unpacked it. That archive contained an older state file with full knowledge of the production DataTalks.Club infrastructure. When Claude suggested switching from AWS CLI cleanup to terraform destroy (arguing it would be "cleaner and simpler" since Terraform had created the resources), Grigorev approved it.

The destroy command didn't just remove the duplicates. It wiped the production database, the network, the compute cluster, the load balancers. Everything.

The backup that wasn't there

Grigorev checked the RDS console for automated snapshots. The events log showed a backup had been created at 2 AM that morning, but when he clicked on it, nothing opened. The snapshot itself was gone, deleted along with the database instance. At 11 PM on a Thursday night, staring at an empty AWS console, that's a particularly grim realization.

He opened a support ticket. No response. So he upgraded to AWS Business Support (which adds roughly 10% to your cloud bill) to get the one-hour response SLA for production incidents. AWS got back to him in about 40 minutes.

The good news: AWS found a snapshot on their end that wasn't visible in Grigorev's console. The bad news: restoring it required escalation to an internal team, and the whole process took 24 hours. The course platform came back online the following night with 1,943,200 rows recovered in the courses_answer table alone.

Who's actually at fault here?

The Hacker News thread predictably split into two camps. One commenter put it bluntly: "If you give a robot the ability to delete production it's going to delete production. This is 100% the user's fault." Others pointed out that Claude had warned Grigorev at least once, and he'd ignored it.

But there's a less comfortable version of this story. Grigorev is not some hobbyist. He runs a data science education platform with 79,000+ Slack members and teaches courses on production ML engineering. If someone with his background can sleepwalk into terraform destroy on production at 11 PM, the tooling has a problem regardless of where you assign blame.

One HN commenter nailed it: "The difference is, an expert engineer would flat-out refuse to do these things and would keep pushing back. Claude may sometimes attempt one time to warn someone, and then ploughs right ahead without further complaint." The consent fatigue argument is real. When your AI agent asks permission for every file read and directory listing, you stop reading the prompts. And then it asks to run terraform destroy and you click yes because you've clicked yes forty times in the last hour.

This keeps happening

Grigorev's incident is the most public, but it is far from unique. A week earlier, on February 19, another developer filed an issue on the Claude Code GitHub repo after an agent autonomously ran drizzle-kit push --force against a production PostgreSQL database on Railway, wiping 60+ tables of trading data and AI research results. That data was unrecoverable because Railway doesn't offer automatic backups. The developer's database had been destroyed because Claude used the --force flag specifically to bypass an interactive confirmation prompt.

Scroll through the Claude Code issues and you'll find a small genre of these posts: panicked titles, destroyed databases, the occasional profanity. Issue #9966 from October 2025 is just someone screaming in all caps.

I'm genuinely unsure what the right product response is. You could argue (and Anthropic probably would) that Claude Code already requires user approval for destructive commands. But the approval mechanism treats terraform destroy with the same UI weight as ls -la. There's no red banner, no forced delay, no "type the resource name to confirm" step that AWS itself uses for deletion.

What Grigorev changed

According to his postmortem, Grigorev implemented several safeguards after the incident: deletion protection on all RDS instances, Terraform state stored in S3 rather than locally, independent backup Lambda functions, and a manual approval gate for any Terraform commands executed by Claude Code. He also separated the infrastructure for different projects, which is what Claude had originally suggested.

The 10% AWS support surcharge is now permanent. He's keeping Business tier.

The real lesson (and it's not about AI)

Deletion protection was available the entire time. S3 state storage is a Terraform best practice older than Claude Code itself. Independent backups are chapter one of any ops handbook. Grigorev knew all of this, teaches some of it, and still didn't do it until after the disaster.

That's not a character failing. It is how infrastructure debt works: you know the risks, you plan to address them, and then you don't because nothing has gone wrong yet. AI agents just compress the timeline between "nothing has gone wrong" and "everything is gone" from months to minutes. The speed at which an LLM can propose, justify, and execute a catastrophic action is genuinely new. The underlying failure mode is as old as rm -rf /.