Meta AI has open-sourced Perception Encoder Audiovisual (PE-AV), the multimodal encoder that powers its newly released SAM Audio sound separation model. The company published the code on GitHub and weights on Hugging Face on December 16, 2025.
PE-AV extends Meta's original Perception Encoder, released in April, into the audio domain. The model learns joint representations across audio, video frames, and text through contrastive training on over 100 million videos, according to Meta. This allows it to extract feature vectors from either audio or video and align them in a shared embedding space, which SAM Audio then uses to identify and isolate specific sounds based on user prompts.
The release includes multiple checkpoint sizes (small, base, and large variants) along with PE-A-Frame, a companion model for audio-frame temporal localization. Meta publishes benchmark results showing the models achieve strong performance on cross-modal retrieval tasks like AudioCaps, Clotho, and VGGSound, though these are company-reported figures on the repository's own benchmarks. Independent verification hasn't been published yet.
Meta is already using PE-AV internally to build creative audio tools across its apps, and the company has partnered with hearing-aid manufacturer Starkey to explore accessibility applications. The models are released under Apache 2.0 license.
The Bottom Line: PE-AV gives developers access to the same audiovisual understanding engine Meta uses for SAM Audio, with code and weights available now on GitHub and Hugging Face.
QUICK FACTS
- Released: December 16, 2025
- License: Apache 2.0
- Training data: 100+ million videos (company-reported)
- Checkpoint sizes: Small, Base, Large variants available
- Repository: github.com/facebookresearch/perception_models
- Related release: SAM Audio, available in Segment Anything Playground




