Stanford Medicine researchers published a new paper in Nature Medicine yesterday describing SleepFM, an AI model that can predict 130 different health conditions from a single overnight sleep study. The dataset behind it: 585,000 hours of polysomnography recordings from about 65,000 participants. Which is, by a wide margin, the largest sleep foundation model ever trained.
The headline numbers are striking. C-index of 0.84 for all-cause mortality. 0.85 for dementia. 0.89 for Parkinson's disease. 0.81 for heart attacks. If you're not familiar with C-index, an 0.80 means the model correctly ranks which of two people will develop a condition first 80% of the time.
But I want to slow down on what they're actually claiming here.
What the model actually does
SleepFM takes polysomnography data, the full sensor suite you'd get in a clinical sleep study, and processes brain waves, heart signals, muscle activity, and respiratory patterns simultaneously. The researchers developed something called leave-one-out contrastive learning, which essentially hides one data stream and asks the model to reconstruct it from the others.
"SleepFM is essentially learning the language of sleep," James Zou, associate professor of biomedical data science and co-senior author, told Stanford Medicine's news office.
That phrase got repeated in every press release. I'm a little skeptical of the metaphor, but the underlying approach is sound. Foundation models have worked well in pathology and cardiology. Sleep has been largely ignored despite generating enormous amounts of data. A full night of polysomnography produces gigabytes of signals that mostly get reduced to a few summary statistics.
The training data problem
Here's what the paper acknowledges openly: the dataset comes primarily from people referred to sleep clinics because something was already wrong. The Stanford Sleep Clinic data spans 1999 to 2024. BioSerenity contributed 18,900 recordings. MESA and MrOS added a few thousand more.
All of these are clinical populations. Not random samples of healthy sleepers. This matters because the model may be picking up on signals that correlate with why people end up in sleep clinics in the first place, not just sleep-related disease markers.
The researchers did validate on the Sleep Heart Health Study, which they held out from training entirely. Performance held up reasonably well: C-index of 0.82 for stroke, 0.85 for congestive heart failure, 0.88 for cardiovascular death. Convenient that these are mostly cardiovascular outcomes, though. The SHHS dataset has limited diagnostic overlap with Stanford's.
The comparison to demographics alone
This is where it gets interesting.
A simple baseline using just age, sex, BMI, and race/ethnicity achieves pretty solid disease prediction on its own. We're talking about diseases that strongly correlate with getting older, being male, and weighing more. The SleepFM model beats this baseline by 5-17% depending on the disease category, but that's a smaller margin than the headline numbers suggest.
The researchers tested another baseline: same architecture as SleepFM, same input data, trained from scratch without the foundation model pretraining. SleepFM outperforms this too, which demonstrates that the self-supervised pretraining is actually doing something useful rather than just being a bigger model.
Neurological conditions showed the largest gains. Myoneural disorders went from 0.42 with the end-to-end model to 0.81 with SleepFM. Developmental delays jumped from 0.58 to 0.80. These are the cases where the pretrained representations seem to genuinely capture something the supervised model misses.
What I couldn't find
The source code is available, which is good. Pretrained weights for the base model, disease prediction, and sleep staging are all there. They've also released the Stanford sleep dataset they used for training.
What's harder to evaluate: how this would work with consumer sleep devices. Polysomnography requires clinical equipment. Most people aren't spending nights wired up at Stanford. The paper doesn't address whether the model's predictive power would transfer to, say, an Apple Watch or Oura ring. Emmanuel Mignot, the co-senior author, mentioned that wearable sleep technologies are advancing, but that's a significant gap between clinical PSG and what's actually accessible.
The interpretability situation is also murky. Zou's team has "developed different interpretation techniques to figure out what the model is looking at," but the model remains essentially a black box. Heart signals predict cardiovascular disease. Brain signals predict neurological conditions. Okay. But the specific patterns driving predictions for, say, prostate cancer from sleep data? That's not explained.
The scaling question
SleepFM uses 5-25 times more data than previous supervised sleep models. The architecture is transformer-based with about 4.4 million parameters for the full model, 0.91 million for the fine-tuning head. Not enormous by current standards.
Performance improves steadily with more fine-tuning data, but the curves in their scaling experiments don't show whether there's a ceiling. With 10% of the training data, SleepFM already outperforms the demographics baseline trained on 100% of data for most conditions. That's a strong result for label efficiency.
What it's good at
Parkinson's disease prediction stands out. C-index of 0.89. This makes biological sense. REM sleep behavior disorder, sleep without atonia, and abnormal breathing patterns are known early markers of Parkinson's. The model appears to be picking these up.
Dementia prediction hit 0.85. Again, plausible. Sleep disturbances precede Alzheimer's symptoms by years. Reduced slow-wave activity, spindle abnormalities, and REM sleep issues have all been linked to early neurodegeneration in prior research.
Cardiovascular conditions performed well but not as dramatically better than existing ECG-based models. The paper cites a previous study that achieved 0.84 AUROC for cardiovascular mortality on a subset of SHHS participants with sleep apnea. SleepFM got 0.88 on the full cohort. An improvement, but these conditions already have decent predictors.
Cancer predictions were unexpectedly strong. Prostate cancer: 0.90. Breast cancer: 0.90. The paper references some literature linking sleep patterns to cancer risk, but this connection is less mechanistically clear than the neurological findings. I'd want to see replication.
The actual clinical path
The Stanford Sleep Medicine Center was founded in 1970. They have decades of linked records. Most institutions don't have this.
To use SleepFM in practice, you'd need polysomnography data paired with long-term health outcomes. The model handles variable channel configurations, which is useful since different clinics use different equipment. But getting the EHR linkage for training is the hard part.
For the disease prediction to be actionable, you'd also need interventions. Telling someone their sleep study suggests elevated Parkinson's risk in 2035 is different from having something to do about it.
The fine print
The paper runs to 19 pages with extensive supplementary materials. They tested 1,041 disease phenotypes mapped from ICD codes. 130 conditions achieved C-index and AUROC of at least 0.75 with Bonferroni-corrected p-values below 0.01.
Performance degrades somewhat on temporal test sets from 2020 onwards. The model was trained on pre-2020 data. Expected, but worth noting for deployment.
Sleep staging accuracy matches or beats U-Sleep and YASA, the current standard tools, with F1 scores of 0.70-0.78. Not the main point of the paper, but it validates that the representations capture real sleep physiology.
The model currently lags specialized sleep staging systems on some external datasets. HMC, a public validation set, shows weaker performance. This suggests the foundation model approach doesn't automatically dominate on every task.




