Each cell in your body contains the same genetic sequence, but each cell represents only a subset of these genes. These cell-specific gene expression patterns ensure that brain cells differ from skin cells, in part, depend on the three-dimensional structure of the genetic material, which controls the accessibility of each gene.
Now, chemists at MIT have proposed a new way to use generative artificial intelligence to determine these 3D genomic structures. Their technique can predict thousands of structures in just a few minutes, making it much faster than existing experimental methods for analyzing structures.
Using this technique, researchers can more easily study how 3D tissue of the genome affects the gene expression patterns and function of individual cells.
“Our goal is to try to predict three-dimensional genomic structure from the underlying DNA sequences,” said Bin Zhang, associate professor of chemistry and senior author of the study. “Now we can do that, which makes this technology comparable to cutting-edge experimental techniques, and it really opens up a lot of interesting opportunities.”
MIT graduate students Greg Schuette and Zhuohan Lao are the lead authors of this article, appearing today in Science Advances.
From sequence to structure
Inside the nucleus, DNA and protein form a complex called chromatin, which has multiple tissues that allow cells to stuff 2 meters of DNA into a nucleus, with a diameter of only 100 millimeters. A long string of DNA around proteins is called histones, resulting in some structure similar to beads on strings.
Chemical tags called epigenetic modifications can be linked to DNA at specific locations, which vary by cell type and can affect the folding of chromatin and the accessibility of nearby genes. These differences in chromatin conformation help determine which genes are expressed in different cell types or at different times within a given cell.
Over the past 20 years, scientists have developed experimental techniques for determining chromatin structure. A widely used technique called HI-C is through the action of connecting adjacent DNA strands in the nucleus together. The researchers can then determine which fragments are close to each other by cutting the DNA into many small pieces and sorting them.
This method can be used for large cell populations to calculate the average structure of a portion of chromatin or a single cell to determine the structure within that particular cell. However, HI-C and similar techniques are labor-intensive and it can take about a week to generate data from a cell.
To overcome these limitations, Zhang and his students developed a model that leverages the latest advances in generating AI, thus creating a fast, accurate method to predict chromatin structures in a single cell. The AI model they designed could quickly analyze DNA sequences and predict the chromatin structures that these sequences might produce in cells.
“Deep learning is really good at pattern recognition,” Zhang said. “It allows us to analyze very long DNA fragments, thousands of base pairs, and figure out what important information is encoded in these DNA base pairs.”
The model Chomogen created by the researchers has two components. The first component is a deep learning model that teaches “reading” the genome, analyzing the information encoded in the underlying DNA sequences and chromatin accessibility data, which are widely available and specific to cell types.
The second component is a generated AI model that predicts physically accurate chromatin conformations and has been trained on over 11 million chromatin conformations. These data were generated on 16 cells on human B lymphocytes in 16 cells using DIP-C (a variant of HI-C).
After integration, the first component provides the generative model with how the environment specific to the cell type affects the formation of different chromatin structures, and the scheme effectively captures sequence structural relationships. For each sequence, researchers use their models to generate many possible structures. That’s because DNA is a very disordered molecule, so a single DNA sequence produces many different possible conformations.
“One of the main complex factor in predicting genome structure is that no matter what parts of the genome you are looking for, there is no solution,” Schutt said.
Quick analysis
Once trained, the model can produce predictions on a faster timeline than HI-C or other experimental techniques.
“While you might spend six months in a given cell to get dozens of structures, you can use our model to generate a thousand structures in a specific area in 20 minutes using one GPU,” Schuette said.
After training their model, the researchers used it to generate structural predictions of more than 2,000 DNA sequences, and then compared them to experimentally determined structures for these sequences. They found that the model produced structures that were the same or very similar to those seen in the experimental data.
“We usually look at hundreds or thousands of conformations for each sequence, which allows you to reasonably represent the diversity of structures that a particular region can have,” Zhang said. “If you repeat the experiment multiple times, in different cells, you will most likely end up in very different conformations. That’s what our model is trying to predict.”
The researchers also found that the model can accurately predict the data from the data of the cell types trained. This suggests that this model is useful for analyzing how chromatin structures differ between cell types and how these differences affect their function. The model can also be used to explore different chromatin states that may exist in a single cell and how these changes affect gene expression.
“Colomogen provides a new framework for AI-driven genome folding principles and demonstrates that generative AI can have 3D genomic structures that bridge genome and epigenomic features, pointing to future studies of changes in genome structure and function, with no scope of various biological contexts involving Carnegie Mellon’s computational biology major.
Another possible application is to explore how mutations in specific DNA sequences change chromatin conformation, which may elucidate how such mutations cause disease.
“I think we can solve a lot of interesting problems with this model,” Zhang said.
The researchers provided all the data and models to others who wanted to use the data.
The study was funded by the National Institutes of Health.