AI breakthrough could dramatically lower the cost of drug development
A language model learned a yeast’s codon “dialect” and often beat commercial tools at designing genes for higher protein output.

Edited By: Joseph Shavit

MIT engineers used a language model to pick yeast-friendly codons, boosting production of six proteins, including trastuzumab. (CREDIT: Adobe Stock)
Three-letter DNA “words” can decide whether a yeast cell cranks out a medicine efficiently or sputters along. The words are called codons, and they are the genetic code’s way of spelling out amino acids, the building blocks of proteins.
For drugmakers, those tiny choices add up. Industrial yeasts already manufacture vaccines and other protein-based drugs, but getting a new protein production process working well can take a lot of trial and error. Massachusetts Institute of Technology chemical engineers now report a different approach: let a language model learn the yeast’s codon habits, then ask it to write a gene that the yeast can translate more smoothly.
The team focused on Komagataella phaffii, a yeast widely used for making recombinant proteins. J. Christopher Love, the Raymond A. and Helen E. St. Laurent Professor of Chemical Engineering at MIT, led the work with former MIT postdoc Harini Narayanan as lead author. Their study appeared this week in the Proceedings of the National Academy of Sciences.
A yeast that reads one genetic dialect
Codons come in sets of three DNA letters. You only need 20 amino acids to build natural proteins, but the genetic code has 64 possible codons. That mismatch means many amino acids can be encoded by more than one codon.
Organisms do not use those synonymous codons evenly. Each species has its own codon usage bias, and even within a single organism, patterns can vary by gene, position, and local context. Those patterns also connect to practical constraints inside the cell. Each codon corresponds to a transfer RNA (tRNA) molecule that delivers the right amino acid to the ribosome. If an engineered gene leans too heavily on a single codon, a cell can run low on the matching tRNA, and translation can bog down.
Many codon “optimization” tools still rely heavily on choosing the most frequent codons in the host organism. Love’s group argues that approach can miss the point. Rare codons can matter, and the placement of codons next to each other can affect how the cell handles the message.
The new model tries to learn that grammar directly from K. phaffii’s own genes. The researchers trained an encoder-decoder style large language model on amino acid sequences and their matching DNA coding sequences from roughly 5,000 proteins naturally produced by the yeast. The training data came from a publicly available dataset at the National Center for Biotechnology Information.
“The model learns the syntax or the language of how these codons are used,” Love said in the source material. It accounts for neighboring codons and longer-range relationships across a gene.
Testing the model against commercial tools
After training, the team asked the model to design codon-optimized sequences for six proteins that differ in size and complexity: human growth hormone (hGH), human growth colony-stimulating factor (hGCSF), a VHH nanobody called 3B2, an engineered variant of a SARS-CoV-2 receptor binding domain (RBD), human serum albumin (HSA), and the IgG1 monoclonal antibody trastuzumab.
They then compared their model’s designs with sequences produced by four commercial codon optimization tools: Azenta, IDT, GenScript, and Thermo Fisher (Thermo). The researchers inserted each version into K. phaffii cells and measured how much target protein the cells produced.
Across the six proteins, the MIT model produced the best titer for five. For the sixth, it ranked second. Narayanan called out the side-by-side testing in the source: “We’ve experimentally compared these approaches and showed that our approach outperforms the others.”
The paper also reports that codon optimization helped some molecules more than others. For hGH and hGCSF, the team observed about a 25% improvement. HSA saw a bigger swing, with about a threefold improvement reported when comparing optimized constructs to the native coding sequence.
The study includes some concrete numbers for serum albumins. Using native sequences, HSA reached a titer of 45 mg/L, while bovine serum albumin (BSA) and mouse serum albumin (MSA) reached 60 mg/L and 100 mg/L, respectively, in K. phaffii. Codon optimization increased BSA and MSA titers by an additional 25%, to 75 mg/L and 135 mg/L.
The commercial tools varied in consistency across proteins. GenScript produced the best titer for one molecule, trastuzumab, but often landed between 80% and 100% of the top titer across the set. Thermo produced best titers for three of the six proteins, yet performed poorly on two others, including lower results for the more complex HSA and trastuzumab in this dataset. IDT ranked lowest on the study’s two performance metrics and did not produce the best titer for any tested protein.
What the model seemed to learn, without being told
The researchers did not only look at output titers. They also looked under the hood to see what the model internalized.
When they visualized the numerical “embeddings” the model learned, amino acids clustered by physicochemical traits. The paper describes groupings such as aliphatic, aromatic, basic, acid/amide, and alcohol categories, with hydrophobic residues clustering together and polar residues clustering together.
They also report that the model avoided certain genetic sequence features that can interfere with expression, even though the model was not explicitly trained to do so. For the six experimentally tested proteins, constructs designed by the model contained no negative cis-regulatory elements in the analysis described in the source material. The researchers also report that the model’s constructs avoided negative repeat elements, consistent with commercial tools.
Another finding cuts against common shortcuts. The study examined several codon usage bias metrics often used to judge optimized sequences, such as the Codon Adaptation Index (CAI) and codon pair-based measures. The authors report that none of these global metrics consistently correlated with protein titers across different proteins. In some cases, CAI even showed a negative correlation with titer for specific molecules in their dataset.
The paper also touches on mRNA stability, estimated using predicted RNA secondary structure folding energy. The model’s constructs tended to be among the more stable designs for each molecule, but the authors report no simple, strong link between predicted stability and final titers across the tested proteins.
Limits, and the cost problem it targets
The team frames this as a way to reduce expensive uncertainty in process development. Love notes that predictive tools can shorten the time from “having an idea to getting it into production,” and argues that removing uncertainty saves time and money.
The paper also flags a key limitation: the model was trained for a single host organism, K. phaffii. The authors report that models trained on other organisms, including humans and cows, produced different predictions, which suggests codon optimization needs species-specific models. They also describe why they used a GRU-based encoder-decoder architecture rather than a Transformer architecture, given the size of the species-specific dataset they used.
The experimental validation in the study covers six proteins, and the authors emphasize the broader complexity of protein production. Codon optimization is only one lever among others such as cellular engineering, media design, and process optimization.
Research findings are available online in the journal PNAS.
The original story "AI breakthrough could dramatically lower the cost of drug development" is published in The Brighter Side of News.
Related Stories
- PROTEUS System: Artificial biological intelligence to change the future of healthcare
- Researchers use AI to help repurpose drugs to treat rare diseases
- MIT and Harvard study could lead to drugs that mimic the benefits of exercise
Like these kind of feel good stories? Get The Brighter Side of News' newsletter.
Shy Cohen
Writer



