By tweaking an artificial intelligence model called the Big Language Model, researchers have made great strides from their ability to predict protein structures in their order. However, for antibodies, this approach is not that successful, partly due to excessive changes in this type of protein.
To overcome this limitation, MIT researchers have developed a computing technology that allows large language models to predict antibody structures more accurately. Their work could enable researchers to screen millions of possible antibodies to identify antibodies that can be used to treat SARS-COV-2 and other infectious diseases.
“Our approach allows us to scale, while other approaches cannot find some needles in the sea heap,” said Bonnie Berger, head of the computer and biology group at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). “If we can help stop drug companies from doing clinical trials with the wrong thing, that can save a lot of money.”
This technique focuses on modeling high-variable regions of antibodies and also has the potential to analyze the entire antibody library of individuals. This may be useful for studying the immune response of super responders of diseases such as HIV to help figure out why their antibodies are so effective at eliminating the virus.
Bryan Bryson, an associate professor of bioengineering at MIT and a member of the Ragon School of MIT and Harvard University, is also a senior author of the paper, which appeared this week. Proceedings of the National Academy of Sciences. Rohit Singh is a former CSAIL research scientist and is now an assistant professor of biostatistics and bioinformatics and cell biology at Duke University, while Chiho IM ’22 is the lead author of this article. Researchers from Sanofi and Zurich Eth also contributed to the study.
Over-modeling
A protein consists of long chains of amino acids that can fold into a large number of possible structures. In recent years, it has become easier to predict these structures using artificial intelligence programs such as Alphafold. Many of these programs, such as Esmfold and Omegafold, are based on large language models that were originally developed to analyze large amounts of text, allowing them to learn to predict the next word in order. This approach can be applied to protein sequences by learning which protein structure is most likely to be formed from different patterns of amino acids.
However, this technique does not always work on antibodies, especially in antibodies called high variable regions. Antibodies usually have a Y-shaped structure, these high-variable regions are located at the tip of Y, and they detect and bind to foreign body proteins, also known as antigens. The bottom of Y provides structural support and helps the antibodies interact with immune cells.
The length of the highly variable region varies, but usually contains less than 40 amino acids. It is estimated that by changing the sequence of these amino acids, the human immune system can produce up to 100 billion different antibodies, ensuring that the body can respond to a variety of potential antigens. These sequences are not as limited in evolution as other protein sequences, so it is difficult for large language models to learn to accurately predict their structure.
“Part of the reason why language models can predict protein structures well is evolution to limit these sequences in a way that models decipher what these constraints mean,” Singh said. “This is similar to learning grammatical rules by looking at the word context in a sentence, allowing you to figure out what it means.”
To model those high-variable regions, the researchers created two modules based on existing protein language models. One of these modules is trained on high-variable sequences of approximately 3,000 antibody structures in the Protein Database (PDB), allowing it to learn which sequences tend to generate similar structures. Another module was trained, and the data bound approximately 3,700 antibody sequences strongly to them binding to three different antigens.
The resulting computational model (called ABMAP) can predict antibody structure and binding strength based on its amino acid sequence. To demonstrate the usefulness of the model, the researchers used it to predict the antibody structure of spike proteins that will strongly neutralize the SARS-COV-2 virus.
The researchers started with predicting a set of antibodies that bind to the target and then generated millions of variants by changing the high-variable region. Their model was able to identify more successful antibody structures than traditional protein structure models based on large language models.
The researchers then took the additional step of aggregating the antibodies into groups with similar structures. They selected antibodies from each cluster for experimental testing and collaborated with Sanofi’s researchers. These experiments found that 82% of these antibodies had better binding strength than the original antibodies entering the model.
The researchers say identifying a variety of good candidates early in the development process can help pharmaceutical companies avoid spending a lot of money to test candidates that fail later.
“They don’t want to put all the eggs in one basket,” Singh said. “They don’t want to say, I’m taking an antibody and doing it through a preclinical trial and then proving it’s toxic. They’d rather have a range of good possibilities and move all possibilities so that if a person goes wrong, they have some options.”
Comparison of antibodies
Using this technique, researchers can also try to answer some long-term questions about why different people differ in infection. For example, why do some people develop more severe forms of interconnection and why do some people exposed to HIV never get HIV?
Scientists have been trying to answer these questions by sequencing individuals single-cell RNA and comparing immune cells, a process called antibody repertoire analysis. Previous work shows that antibody libraries from two different people may overlap to 10%. However, sequencing does not have a comprehensive picture of antibody performance like structural information, because two antibodies with different sequences may have similar structures and functions.
The new model can help solve this problem by quickly generating structures of all antibodies in an individual. In this study, the researchers showed that when structure is considered, overlap between individuals is much more than 10% of sequence comparisons. They now plan to further study how these structures contribute to the body’s overall immune response to specific pathogens.
“This is a very nice place for a language model because it has the scalability of sequence-based analysis, but it can be close to the accuracy of structure-based analysis,” Singh said.
The study, funded by Sanofi and Abdul Latif Jamel Clinic, is used for healthy machine learning.