Meta Releases PE-AV, the Audiovisual Encoder Behind SAM Audio

Visualization of audio and video data streams converging into unified AI embeddings

Meta AI has open-sourced Perception Encoder Audiovisual (PE-AV), the multimodal encoder that powers its newly released SAM Audio sound separation model. The company published the code on GitHub and weights on Hugging Face on December 16, 2025.

PE-AV extends Meta's original Perception Encoder, released in April, into the audio domain. The model learns joint representations across audio, video frames, and text through contrastive training on over 100 million videos, according to Meta. This allows it to extract feature vectors from either audio or video and align them in a shared embedding space, which SAM Audio then uses to identify and isolate specific sounds based on user prompts.

The release includes multiple checkpoint sizes (small, base, and large variants) along with PE-A-Frame, a companion model for audio-frame temporal localization. Meta publishes benchmark results showing the models achieve strong performance on cross-modal retrieval tasks like AudioCaps, Clotho, and VGGSound, though these are company-reported figures on the repository's own benchmarks. Independent verification hasn't been published yet.

Meta is already using PE-AV internally to build creative audio tools across its apps, and the company has partnered with hearing-aid manufacturer Starkey to explore accessibility applications. The models are released under Apache 2.0 license.

The Bottom Line: PE-AV gives developers access to the same audiovisual understanding engine Meta uses for SAM Audio, with code and weights available now on GitHub and Hugging Face.

QUICK FACTS

Released: December 16, 2025
License: Apache 2.0
Training data: 100+ million videos (company-reported)
Checkpoint sizes: Small, Base, Large variants available
Repository: github.com/facebookresearch/perception_models
Related release: SAM Audio, available in Segment Anything Playground

Tags:Meta AImultimodal AIopen sourceSAM Audioembeddings

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Meta Releases PE-AV, the Audiovisual Encoder Behind SAM Audio

QUICK FACTS

Andrés Martínez

Related Articles

Perplexity Open-Sources Embedding Models That Beat Anthropic and Voyage

Skywork Publishes SkyReels-V4 Technical Report for Unified Video-Audio Model

Alibaba Ships Qwen 3.5 Medium Models With 7x Efficiency Gains

Stay Ahead of the AI Curve