Researchers at Rice University have developed a breakthrough method to rapidly generate massive protein datasets that can be used to train artificial intelligence models. The new approach addresses one of the biggest challenges in AI-driven protein engineering: the lack of high-quality experimental data needed to teach models how protein mutations affect function. This is a major advancement because proteins can have an almost unimaginable number of possible amino acid combinations, making laboratory testing alone impractical.
The method, called Sequence Display, can produce more than 10 million data points in a single experiment. Scientists create many protein variants, attach unique DNA barcodes to each one, and then measure how active each variant is. These activity-linked barcodes are read through next-generation sequencing, creating a rich dataset that AI models can learn from. This allows the system to predict which mutations are most likely to improve a protein’s activity or function.
For proof of concept, the research team tested the technique on a compact CRISPR-Cas protein. Their goal was to find mutations that would allow the protein to target a wider range of DNA sequences. After training the AI on the generated dataset, the model successfully predicted mutations that significantly enhanced the protein’s performance. The researchers were also able to repeat the process on several other proteins, showing that the approach is broadly applicable.
This development highlights how AI and experimental biology can work together rather than separately. Instead of replacing lab work, the AI depends on the experimentally generated data to search a much larger mutation space more efficiently. The researchers believe this framework could accelerate the discovery of improved research tools, advanced enzymes, and next-generation therapeutic proteins, potentially transforming fields such as medicine, biotechnology, and synthetic biology.