MIT LLM Optimizes Yeast DNA for Cheaper Drug Production

MIT chemical engineers have repurposed a large language model to optimize how industrial yeast manufactures protein drugs, publishing results this week in the Proceedings of the National Academy of Sciences that outperformed four commercial codon optimization tools across nearly every test case. The model, called Pichia-CLM, treats DNA sequences the way a chatbot treats sentences, and the approach yielded up to threefold improvements in protein output from the yeast Komagataella phaffii.

That last detail matters more than it might sound. K. phaffii (you may know it by its old name, Pichia pastoris) is not some lab curiosity. It already produces insulin, hepatitis B vaccines, and a monoclonal antibody for chronic migraines at commercial scale. Getting more protein out of each batch directly translates to cheaper drugs.

The codon problem

Here is the core tension: twenty amino acids, sixty-four possible DNA triplets to encode them. Most amino acids have multiple codons that can code for them, and the conventional wisdom has been to pick whichever codon appears most frequently in your host organism. Seems logical. But it backfires. If every arginine in your sequence uses the same popular codon, the cell's pool of matching tRNA molecules gets depleted, and production stalls.

Commercial optimization tools have tried various workarounds, but the MIT team, led by senior author J. Christopher Love and former postdoc Harini Narayanan, took a different approach entirely. They trained an encoder-decoder language model on the amino acid and corresponding DNA sequences for all roughly 5,000 proteins that K. phaffii naturally produces, using data from the National Center for Biotechnology Information.

"The model learns the syntax or the language of how these codons are used," Love says, calling out both local codon-to-codon relationships and long-range dependencies across a gene. That framing is worth pausing on: the model is not just counting frequencies. It is learning context, the way one codon choice influences what should come hundreds of positions downstream.

Five out of six is pretty good

Narayanan and Love tested the model's output against four commercial codon optimization tools on six therapeutically relevant proteins: human growth hormone, human serum albumin, trastuzumab (the cancer-treating antibody sold as Herceptin), and three others. They inserted each optimized sequence into live K. phaffii cells and measured actual protein yields.

Pichia-CLM produced the highest yields for five of six proteins. On the sixth, it came second. "We've experimentally compared these approaches and showed that our approach outperforms the others," Narayanan says, which is the kind of measured claim that holds up better than most benchmarks in this space, since they actually measured wet-lab protein output rather than relying on computational proxies like codon adaptation index scores. The paper, in fact, shows poor correlation between those standard metrics and real protein yields, a finding that quietly undermines how most of the industry evaluates codon optimization.

What the model figured out on its own

The interesting part is not the benchmark results. It is what the model learned without being told to learn it. When the researchers examined the model's internal representations, they found it had independently discovered that negative repeat elements (DNA sequences that suppress nearby gene expression) should be avoided. Nobody programmed that rule in. The model also grouped amino acids by biochemical properties like hydrophobicity, which suggests it picked up real biological structure from sequence data alone.

"Not only was it learning this language, but it was also contextualizing it through aspects of biophysical and biochemical features," Love says. That is a stronger endorsement than the benchmark numbers, because it implies the model's improvements are not just pattern-matching noise. There is biological reasoning buried in the weights.

Species specificity turned out to be non-negotiable. When the team trained equivalent models on human and bovine genomes, each produced entirely different optimization predictions. You cannot train once and generalize, which is both a limitation and a validation that the model is capturing real organism-specific biology rather than generic statistical patterns.

So what changes?

The development pipeline for new biologic drugs (the genetic engineering, growth optimization, and purification steps) accounts for 15 to 20 percent of total commercialization costs, according to the MIT announcement. A model that reliably predicts optimal codon sequences on the first or second try, rather than requiring rounds of expensive experimental iteration, chips away at that cost directly.

Love's lab has already started using Pichia-CLM for internal projects, and the code is publicly available for other researchers to adapt to K. phaffii or other host organisms. The research was funded by the Daniel I.C. Wang Faculty Research Innovation Fund at MIT, the MIT AltHost Research Consortium, the Mazumdar-Shaw International Oncology Fellowship, and the Koch Institute for Integrative Cancer Research.

MIT Trains a Language Model to Read Yeast DNA and Boost Drug Production

The codon problem

Five out of six is pretty good

What the model figured out on its own

So what changes?

Liza Chan

Related Articles

OpenAI Previews GPT-5.6 Sol, Terra, and Luna in Limited Release

Microsoft Swaps OpenAI and Anthropic for MAI Models in Excel and Outlook

Anthropic Extends Claude Fable 5 Subscription Access to July 12

Stay Ahead of the AI Curve