Google's Bird Sound AI Outperforms Marine Models at Classifying Whale Songs

Google Research published a blog post Sunday detailing how Perch 2.0, a bioacoustics foundation model from Google DeepMind trained on over 14,500 species of mostly birds, mammals, and insects, consistently outperforms dedicated marine models when classifying whale vocalizations underwater. The model has never heard a whale during training. Or close to it.

The work, presented at NeurIPS 2025 in a research paper by Andrea Burns, Lauren Harrell, and colleagues, tested Perch 2.0 against six other models on tasks like distinguishing baleen whale species and telling apart killer whale subpopulations. Perch 2.0 ranked first or second across every dataset and sample size they threw at it.

This is about sound, not images

One thing worth clarifying upfront: Perch 2.0 is not a computer vision model. It processes audio, specifically log-mel spectrograms, through an EfficientNet-B3 convolutional architecture. The model was trained on 1.5 million labeled recordings from Xeno-Canto, iNaturalist, the Tierstimmenarchiv, and FSD50K. Almost all of it terrestrial. The fact that it transfers to underwater acoustics, where signal properties shift with depth, temperature, salinity, and underwater topography, is the surprising part.

Google's existing multispecies whale model, released in 2024, was purpose-built for this domain. Perch 2.0 beat it anyway, despite including almost no marine mammal audio in training. That's either a strong endorsement of transfer learning or an indictment of how the whale model was built, and the researchers don't really explore the second possibility.

The killer whale test

The most interesting evaluation involved the DCLDE dataset, which contains recordings of five killer whale ecotypes: Northern Residents, Southern Residents, Transient/Biggs, Southeastern Alaska, and Offshore populations. These are all the same species, just locally adapted groups with distinct vocal repertoires. Telling them apart acoustically is hard.

When the team visualized embeddings using tSNE, Perch 2.0 produced the cleanest separation between Northern Resident and Transient killer whales, a boundary that models like AVES-bio and SurfPerch couldn't resolve at all. "Perch 2.0 appears to have the best boundary between the NRKW and TKW ecotypes," the paper states, and the visualizations back that up, though tSNE plots are notoriously sensitive to hyperparameter choices, so some caution is warranted.

The team calls this the "bittern lesson," a play on Rich Sutton's famous "Bitter Lesson" essay arguing that general methods scale better than specialized ones. Their version: a model forced to distinguish between 14 species of North American doves, each with its own subtly different coo, learns acoustic features granular enough to separate killer whale dialects. It's a neat hypothesis. Whether it fully explains the transfer performance or whether something simpler is going on (bigger model, more data, better training pipeline) is harder to say.

What the researchers aren't saying

The Perch 2.0 paper from August 2025 notes the model uses self-distillation and a source-prediction training criterion alongside standard classification. These aren't trivial additions. The paper also acknowledges that other non-marine models (BirdNet v2.3 in particular) sometimes performed comparably on certain tasks. Perch 2.0 wasn't untouchable.

And the few-shot evaluation protocol, training a logistic regression on 4 to 32 examples per class, tells us about embedding quality. It doesn't tell us how the model performs when deployed on months of continuous passive acoustic data from NOAA hydrophones, where the real bottleneck lives. The team has released a Colab tutorial for that workflow, which is a good sign, but real-world validation is a different paper.

Why it matters anyway

Marine bioacoustics is expensive and messy. Deploying hydrophones requires specialized ceramic pressure sensors, remote mooring buoys, and scuba divers. You can't visually confirm which species is vocalizing the way you can point a camera at a bird feeder. New whale song types keep showing up; NOAA recently attributed the mysterious "biotwang" sound to Bryde's whales after years of uncertainty.

If a general-purpose model trained on birdsong can produce usable classifiers from a handful of labeled whale examples in under an hour (what the team calls "agile modeling"), that changes the economics of ocean monitoring. Not in a flashy way. In the boring, practical way where marine biologists spend less time training custom models and more time doing fieldwork.

The Perch GitHub repo and model weights on Kaggle are open-source. NOAA's passive acoustic archive is accessible via Google Cloud. The infrastructure exists for anyone to try this.

Google's Bird Sound AI Outperforms Marine Models at Classifying Whale Songs

This is about sound, not images

The killer whale test

What the researchers aren't saying

Why it matters anyway

Liza Chan

Related Articles

Hugging Face Absorbs the Team Behind llama.cpp, Local AI's Most Important Project

Chinese Studios Are Shipping AI-Generated Dramas and Making Real Money

OpenAI Closes $110 Billion Round Led by Amazon, Nvidia, and SoftBank

Stay Ahead of the AI Curve