Claude Opus 4.7 Matches NMR Software in Chemistry Tests

Anthropic published its first chemistry white paper on June 5, claiming that Claude Opus 4.7 can read NMR spectra about as well as ChemDraw and MestReNova, the two pieces of software sitting on basically every working chemist's desktop. The research report, written by Anthropic chemist David Kamber, pitted three Claude models against those tools on 20 compounds. The headline result: a general-purpose model with zero chemistry-specific fine-tuning beat the specialists in a few places and tied them in others.

That last part is what caught my attention. ChemDraw and MestReNova were purpose-built for this. Claude wasn't.

What NMR even is, and why chemists hate doing it by hand

NMR spectroscopy is how you figure out what a molecule actually is. You can't see these things under a microscope, so chemists hit them with magnetic fields and read the pattern that comes back. Every atom in a structure has to be matched, by hand, to a peak on the spectrum. It is one of the most tedious steps in synthetic chemistry, and it eats hours.

The setup here matters because it's where these benchmarks usually fall apart. Anthropic pulled all 20 compounds from ChemRxiv preprints posted after the models' training cutoff, taking the first fully characterized novel molecule from each paper. The point was to make sure none of these structures were sitting in the training data. Twenty compounds across four structural families, five each, every family chosen to stress a different kind of NMR headache.

Twenty is also, by Anthropic's own admission, not many. More on that later.

The forward task: predict the spectrum

The first job is the one the software was built for. Give the tool a known structure, encoded as a SMILES string, and ask it to predict where every hydrogen and carbon peak lands. On hydrogen, Opus 4.7 came out ahead, with an average error of plus or minus 0.079 ppm. The window a chemist calls correct is 0.20 ppm, so that's comfortably inside, less than half the tolerance.

Carbon was closer. Opus 4.7 and MestReNova effectively tied, at 1.37 and 1.48 ppm. A near-tie against a tool designed for exactly this is still a real result, though I'd note the carbon numbers are the less flattering of the two for Claude, and the report doesn't dwell on them.

Worth keeping in mind: these are Anthropic's own benchmarks, run by an Anthropic chemist. The compounds were chosen by the same people reporting the wins. That doesn't make the numbers wrong, but no one independent has rerun this yet.

Peak shapes are where the gap widens

Here's the part of the data I found more convincing than the headline shift numbers. A peak's position isn't the only thing carrying structural information. So does its shape, the splitting pattern, and how far apart the sub-peaks sit. On the spacing between sub-peaks, all three Claude models landed within half a hertz roughly 80% of the time. ChemDraw and MestReNova managed 26 to 35%.

That's not a marginal edge. That's a different league, and it's on a feature chemists actually read alongside peak position when they're telling near-identical molecules apart. Opus 4.7 also matched the reported splitting pattern more often than any other tool, and it was the most consistent across repeat runs, which for a model that gives different answers each time is not nothing.

And then the harder thing: working backwards

The forward task is what the software does. The inverse task, structure elucidation, is what it doesn't. Give a tool a spectrum and ask it to propose the molecule. ChemDraw can't do this at all. MestReNova helps assign peaks to a structure you already have, but won't generate candidates from a peak list. So Claude was effectively competing against nobody here, which is its own kind of point.

Anthropic gave Opus 4.7 15 elucidation problems, each run three times, asking for up to three ranked candidate structures. The model got the compound's exact molecular formula from high-resolution mass spec plus the hydrogen and carbon spectra. On the eight simpler targets, it recovered the correct structure on every single attempt from the formula and spectra alone.

The seven harder ones came with a hint: the structure of the starting material that went into the reaction. With that, Opus 4.7 nailed four of them on all three runs and got two of three runs on the rest. The hint isn't a small caveat. The report is candid that without it, on the densest targets, the model would sometimes loop through its reasoning without ever committing to a final answer. So the inverse result is genuinely interesting, but it's not the model solving these cold.

The pitch is still real, though. Dedicated structure-elucidation software has existed for decades, but it usually wants 2D NMR, specialized training, and a paid license. Claude does it from the same 1D peak list and mass spec a chemist would paste into a chat box. No setup.

The caveats Anthropic actually lists

To the report's credit, the limitations section is longer than most. The evaluation was small. Each scaffold contributes a single class of failure modes. Whole categories of chemistry were left out by design: 2D experiments like COSY and HSQC, stereochemistry, complex natural products. Solvent coverage stopped at three. And some scaffolds, like NH-active heteroaromatics, were sampled through exactly one family.

Kamber writes that the results should be read as indicative rather than precise, and says he'd want to see the numbers hold up across several hundred compounds spanning 20 to 30 scaffold classes before anyone gets carried away. That's the right instinct. Twenty compounds chosen in-house is a promising signal, not a verdict.

What's next, per Anthropic, is structure reading, reaction and synthetic reasoning, mechanism, and chemical literature understanding. Retrosynthesis planning, the thing the field has been promising for years, is still only being scoped. Spectral analysis happens to be the one piece far enough along to benchmark. Whether the rest follows is the open question, and the report doesn't pretend to know.