Repurposing potential diffusion of generative protein folding models - Berkeley Artificial Intelligence Research Blog

Grid is a multi-modal generative model that can simultaneously generate protein 1D sequences and 3D structures by learning the potential space of protein folding models.

The 2024 Nobel Prize was awarded Alphafold2, marking an important moment in the role of AI in biology. What happens next after the protein is folded?

exist lattice,We developed a method that learns to sample from the potential space of protein folding models to produce New protein. It is acceptable Composition functions and organism tips,Can Training in a sequence database2-4 orders of magnitude larger than the structural database. Unlike many previous protein structure generation models, lattices solve the problem of multimodal cogeneration: simultaneously generating discrete sequences and continuous all-atom structure coordinates.

From structural prediction to real-world drug design

Although recent works demonstrate the promise of the ability of diffusion models to generate proteins, there are limitations of previous models that make their application impractical to the real world, for example:

Total atom: Many existing generative models produce only backbone atoms. In order to produce a full atomic structure and place side chain atoms, we need to know the sequence. This creates a multi-modal generation problem that requires both discrete and continuous ways to generate.
Specificity of organisms: Protein biologics designed for human use need Humanizedavoid being destroyed by the human immune system.
Control specifications: Discovering drugs and handing them over to patients is a complicated process. How do we specify these complex constraints? For example, even with biology solved, you might think that tablets are easier to transport than vials, adding new constraints on solubleness.

Generate “useful” proteins

Simple protein production is not as good as control The acquired generation it works protein. What would this interface look like?

To get inspiration, let us consider how to control image generation by composing text cues (e.g., Liu et al., 2022).

In the grid, we mirror this interface Control specifications. The ultimate goal is to complete control of generation through a text interface, but here we treat the composition constraints of the two axes as proof of concept: Function and biology:

Learn functional structure – sequence connection. Grid learning tetrahedral cysteine-FE²⁺/fe³⁺ Coordination patterns often found in metalloproteins while maintaining high sequence level diversity.

Training using sequence training data only

Another important aspect of the lattice model is that we only need sequences to train the generative model! Generative models learn the data distribution defined by their training data, and sequence databases are much larger than structural databases because sequences are much cheaper than experimental structures.

Learn from a larger, broader database. The cost of obtaining protein sequences is much lower than that of experimentally characterized structures, and the sequence database is 2-4 orders of magnitude larger than the structural database.

How does it work?

The reason we are able to train generative models to generate structures using only sequence data is by learning diffusion models Potential space for protein folding models. Then, after sampling from the potential space of the effective protein during inference, we can accept Freezing weight From protein folding model to decode structure. Here we use Esmfold, the successor of the Alphafold2 model, which replaces the search step with a protein language model.

Our approach. During the training process, only sequences are needed to obtain embeddings; in the inference process, we can decode sequences and structures from sampled embeddings. ❄️ indicates the weight of frozen.

In this way, we can understand information using structures in the weight of pretreatment protein folding models in the protein design task. This is similar to how visual language action (VLA) models in robotics leverage the priors contained in the visual model (VLM) trained in Internet-scale data to provide perceptual, reasoning and understanding information.

Latent space for compressed protein folding models

One small wrinkle that directly applies this approach is the potential space for Esmfold (in fact, for many transformer-based models) that require a lot of regularization. This space is also large, so learning this embedding will eventually map to high-resolution image synthesis.

To solve this problem, we also recommend Cheap (adaptation of compressed hourglass embedded in protein)where compression models for joint embedding of protein sequences and structures are learned.

Investigate potential space. (a) Some channels display “massive activation” when we visualize the average of each channel. (b) If we start checking compared to median (gray), we find that this happens on many layers. (c) A large number of activations have also been observed for other transformer-based models.

We found that this potential space is actually highly compressible. By doing some mechanical explanatory to better understand the basic model we are using, we were able to create a whole atom protein generation model.

What’s next?

Although we examined the production of protein sequences and structures in this work, we can adapt to this approach to multimodal generation of any modality, with predictors of these modality ranging from richer modality to fewer modality. Since protein sequence-to-structural predictors begin to cope with increasingly complex systems (such as AlphaFold3 can also predict proteins in complexes with nucleic acid and molecular ligands), it is easy to imagine multimodal generation of more complex systems using the same method. If you are interested in collaborating to expand our approach, or want to test our approach in a wet row, please reach out!

Further links

If you find our paper useful for your research, consider using the following Bibtex for lattice and cheap:

@article{lu2024generating,
  title={Generating All-Atom Protein Structure from Sequence-Only Training Data},
  author={Lu, Amy X and Yan, Wilson and Robinson, Sarah A and Yang, Kevin K and Gligorijevic, Vladimir and Cho, Kyunghyun and Bonneau, Richard and Abbeel, Pieter and Frey, Nathan},
  journal={bioRxiv},
  pages={2024--12},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

@article{lu2024tokenized,
  title={Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure},
  author={Lu, Amy X and Yan, Wilson and Yang, Kevin K and Gligorijevic, Vladimir and Cho, Kyunghyun and Abbeel, Pieter and Bonneau, Richard and Frey, Nathan},
  journal={bioRxiv},
  pages={2024--08},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

You can also check our preprints (lattice, cheap) and codebases (lattice, cheap).

Some bonus protein fun!

Other features with lattice facilitate generations.

Unconditional generation of grids.

The transmembrane protein has hydrophobic residues on the core and is embedded in the fatty acid layer in the core. These are always observed when prompting the lattice with the transmembrane protein keyword.

Additional examples of an overview of active sites based on feature keyword tips.

Compare samples between lattice and full atomic baseline. The lattice samples have better diversity and capture the beta-strand pattern, which is more difficult for protein generation models.

Acknowledgements

Thanks to Nathan Frey for his detailed feedback on this article, and to co-authors of Bair, Genentech, Microsoft Research, and New York University: Wilson Yan, Sarah A. Robinson, Simon Kelow, Kevin K. Yang, Vladimir Gligorijevic, Gligorijevic, Kyunghyun Cho, Kyunghyun Cho, Richard Bonneau, Richard Bonneau, Pieter Abbeel and Nath Abbeel and Nath Abbeel and Nath Abbeel and Nath Abbeel and Nath Abbeel and Nath Abbeel and Nath.

Source link

Repurposing potential diffusion of generative protein folding models – Berkeley Artificial Intelligence Research Blog